Re: Delta compression not so effective

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 3/1/2017 11:36, Linus Torvalds wrote:
On Wed, Mar 1, 2017 at 5:51 AM, Marius Storm-Olsen <mstormo@xxxxxxxxx> wrote:

When first importing, I disabled gc to avoid any repacking until completed.
When done importing, there was 209GB of all loose objects (~670k files).
With the hopes of quick consolidation, I did a
    git -c gc.autoDetach=0 -c gc.reflogExpire=0 \
          -c gc.reflogExpireUnreachable=0 -c gc.rerereresolved=0 \
          -c gc.rerereunresolved=0 -c gc.pruneExpire=now \
          gc --prune
which brought it down to 206GB in a single pack. I then ran
    git repack -a -d -F --window=350 --depth=250
which took it down to 203GB, where I'm at right now.

Considering that it was 209GB in loose objects, I don't think it
delta-packed the big objects at all.

I wonder if the big objects end up hitting some size limit that causes
the delta creation to fail.

You're likely on to something here.
I just ran
git verify-pack --verbose objects/pack/pack-9473815bc36d20fbcd38021d7454fbe09f791931.idx | sort -k3n | tail -n15
and got no blobs with deltas in them.
feb35d6dc7af8463e038c71cc3893d163d47c31c blob 36841958 36461935 3259424358 007b65e603cdcec6644ddc25c2a729a394534927 blob 36845345 36462120 3341677889 0727a97f68197c99c63fcdf7254e5867f8512f14 blob 37368646 36983862 3677338718 576ce2e0e7045ee36d0370c2365dc730cb435f40 blob 37399203 37014740 3639613780 7f6e8b22eed5d8348467d9b0180fc4ae01129052 blob 125296632 83609223 5045853543 014b9318d2d969c56d46034a70223554589b3dc4 blob 170113524 6124878 1118227958 22d83cb5240872006c01651eb1166c8db62c62d8 blob 170113524 65941491 1257435955 292ac84f48a3d5c4de8d12bfb2905e055f9a33b1 blob 170113524 67770601 1323377446 2b9329277e379dfbdcd0b452b39c6b0bf3549005 blob 170113524 7656690 1110571268 37517efb4818a15ad7bba79b515170b3ee18063b blob 170113524 133083119 1124352836 55a4a70500eb3b99735677d0025f33b1bb78624a blob 170113524 6592386 1398975989 e669421ea5bf2e733d5bf10cf505904d168de749 blob 170113524 7827942 1391148047 e9916da851962265a9d5b099e72f60659a74c144 blob 170113524 73514361 966299538 f7bf1313752deb1bae592cc7fc54289aea87ff19 blob 170113524 70756581 1039814687 8afc6f2a51f0fa1cc4b03b8d10c70599866804ad blob 248959314 237612609 606692699

In fact, I don't see a single "deltified" blob until 6355th last line!


For example, we have that HASH_LIMIT  that limits how many hashes
we'll create for the same hash bucket, because there's some quadratic
behavior in the delta algorithm. It triggered with things like big
files that have lots of repeated content.

We also have various memory limits, in particular
'window_memory_limit'. That one should default to 0, but maybe you
limited it at some point in a config file and forgot about it?

Indeed, I did do a
    -c pack.threads=20 --window-memory=6g
to 'git repack', since the machine is a 20-core (40 threads) machine with 126GB of RAM.

So I guess with these sized objects, even at 6GB per thread, it's not enough to get a big enough Window for proper delta-packing?

This repo took >14hr to repack on 20 threads though ("compression" step was very fast, but stuck 95% of the time in "writing objects"), so I can only imagine how long a pack.threads=1 will take :)

But arent't the blobs sorted by some metric for reasonable delta-pack locality, so even with a 6GB window it should have seen ~25 similar objects to deltify against?


--
.marius



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]