Re: Should I store large text files on Git LFS?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jul 25, 2017 at 06:06:49PM +1000, Andrew Ardill wrote:

> Let's have a look:
> 
> $ git rev-list --objects --all |
>   git cat-file --batch-check='%(objectsize:disk) %(objectsize)
> %(deltabase) %(rest)'
> 174 262 0000000000000000000000000000000000000000
> 171 260 0000000000000000000000000000000000000000
> 139 212 0000000000000000000000000000000000000000
> 47 36 0000000000000000000000000000000000000000
> 377503831 2310238304 0000000000000000000000000000000000000000 data.txt
> 47 36 0000000000000000000000000000000000000000
> 500182546 3740427683 0000000000000000000000000000000000000000 data.txt
> 47 36 0000000000000000000000000000000000000000
> 447340264 3357717475 0000000000000000000000000000000000000000 data.txt
> 
> Yep, all zlib.

OK, that makes sense.

> What do you think is a reasonable config for storing text files this
> large, to get good delta compression, or is it more of a trial and
> error to find out what works best?

I think it would really depend on what's in your repo. If you just have
gigantic text files and no big binaries, and you have enough RAM to do
diffs on the text files, it's not unreasonable to just send
core.bigfilethreshold to something really big and not worry about it.

In general, a diff is going to want memory at least 2x the size of the
file (for the old and new images). And we tend to keep in memory all of
the images for a single tree-diff at one time (so if you touched two
gigantic files in one commit, then "git log -p" is probably going to
peak at having all four before/after images in memory at once).

If you just want deltas but not diffs, you can probably do:

  echo '*.gigantic -diff' >.gitattributes
  git config core.bigfilethreshold 10G

I think that will turn off streaming of the blobs in some code paths,
too. But hopefully a _single_ copy of each file would be OK to hold in
RAM. If it's not, you might also be able to get away with packing once
with:

  git -c core.bigfilethreshold=10G repack -adf

and then further repacks will carry those deltas forward. I think we
only apply the limit when actively searching for new deltas, not when
reusing existing ones.

As you can see, core.bigfilethreshold is a pretty blunt instrument. It
might be nice if .gitattributes understood other types of patterns
besides filenames, so you could do something like:

  echo '[size > 500MB] delta -diff' >.gitattributes

or something like that. I don't think it's come up enough for anybody to
care too much about it or work on it.

-Peff



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux