On Tue, Jul 25, 2017 at 06:06:49PM +1000, Andrew Ardill wrote: > Let's have a look: > > $ git rev-list --objects --all | > git cat-file --batch-check='%(objectsize:disk) %(objectsize) > %(deltabase) %(rest)' > 174 262 0000000000000000000000000000000000000000 > 171 260 0000000000000000000000000000000000000000 > 139 212 0000000000000000000000000000000000000000 > 47 36 0000000000000000000000000000000000000000 > 377503831 2310238304 0000000000000000000000000000000000000000 data.txt > 47 36 0000000000000000000000000000000000000000 > 500182546 3740427683 0000000000000000000000000000000000000000 data.txt > 47 36 0000000000000000000000000000000000000000 > 447340264 3357717475 0000000000000000000000000000000000000000 data.txt > > Yep, all zlib. OK, that makes sense. > What do you think is a reasonable config for storing text files this > large, to get good delta compression, or is it more of a trial and > error to find out what works best? I think it would really depend on what's in your repo. If you just have gigantic text files and no big binaries, and you have enough RAM to do diffs on the text files, it's not unreasonable to just send core.bigfilethreshold to something really big and not worry about it. In general, a diff is going to want memory at least 2x the size of the file (for the old and new images). And we tend to keep in memory all of the images for a single tree-diff at one time (so if you touched two gigantic files in one commit, then "git log -p" is probably going to peak at having all four before/after images in memory at once). If you just want deltas but not diffs, you can probably do: echo '*.gigantic -diff' >.gitattributes git config core.bigfilethreshold 10G I think that will turn off streaming of the blobs in some code paths, too. But hopefully a _single_ copy of each file would be OK to hold in RAM. If it's not, you might also be able to get away with packing once with: git -c core.bigfilethreshold=10G repack -adf and then further repacks will carry those deltas forward. I think we only apply the limit when actively searching for new deltas, not when reusing existing ones. As you can see, core.bigfilethreshold is a pretty blunt instrument. It might be nice if .gitattributes understood other types of patterns besides filenames, so you could do something like: echo '[size > 500MB] delta -diff' >.gitattributes or something like that. I don't think it's come up enough for anybody to care too much about it or work on it. -Peff