On 3/1/2017 11:36, Linus Torvalds wrote:
On Wed, Mar 1, 2017 at 5:51 AM, Marius Storm-Olsen <mstormo@xxxxxxxxx> wrote:
When first importing, I disabled gc to avoid any repacking until completed.
When done importing, there was 209GB of all loose objects (~670k files).
With the hopes of quick consolidation, I did a
git -c gc.autoDetach=0 -c gc.reflogExpire=0 \
-c gc.reflogExpireUnreachable=0 -c gc.rerereresolved=0 \
-c gc.rerereunresolved=0 -c gc.pruneExpire=now \
gc --prune
which brought it down to 206GB in a single pack. I then ran
git repack -a -d -F --window=350 --depth=250
which took it down to 203GB, where I'm at right now.
Considering that it was 209GB in loose objects, I don't think it
delta-packed the big objects at all.
I wonder if the big objects end up hitting some size limit that causes
the delta creation to fail.
You're likely on to something here.
I just ran
git verify-pack --verbose
objects/pack/pack-9473815bc36d20fbcd38021d7454fbe09f791931.idx | sort
-k3n | tail -n15
and got no blobs with deltas in them.
feb35d6dc7af8463e038c71cc3893d163d47c31c blob 36841958 36461935
3259424358
007b65e603cdcec6644ddc25c2a729a394534927 blob 36845345 36462120
3341677889
0727a97f68197c99c63fcdf7254e5867f8512f14 blob 37368646 36983862
3677338718
576ce2e0e7045ee36d0370c2365dc730cb435f40 blob 37399203 37014740
3639613780
7f6e8b22eed5d8348467d9b0180fc4ae01129052 blob 125296632 83609223
5045853543
014b9318d2d969c56d46034a70223554589b3dc4 blob 170113524 6124878
1118227958
22d83cb5240872006c01651eb1166c8db62c62d8 blob 170113524 65941491
1257435955
292ac84f48a3d5c4de8d12bfb2905e055f9a33b1 blob 170113524 67770601
1323377446
2b9329277e379dfbdcd0b452b39c6b0bf3549005 blob 170113524 7656690
1110571268
37517efb4818a15ad7bba79b515170b3ee18063b blob 170113524 133083119
1124352836
55a4a70500eb3b99735677d0025f33b1bb78624a blob 170113524 6592386
1398975989
e669421ea5bf2e733d5bf10cf505904d168de749 blob 170113524 7827942
1391148047
e9916da851962265a9d5b099e72f60659a74c144 blob 170113524 73514361
966299538
f7bf1313752deb1bae592cc7fc54289aea87ff19 blob 170113524 70756581
1039814687
8afc6f2a51f0fa1cc4b03b8d10c70599866804ad blob 248959314 237612609
606692699
In fact, I don't see a single "deltified" blob until 6355th last line!
For example, we have that HASH_LIMIT that limits how many hashes
we'll create for the same hash bucket, because there's some quadratic
behavior in the delta algorithm. It triggered with things like big
files that have lots of repeated content.
We also have various memory limits, in particular
'window_memory_limit'. That one should default to 0, but maybe you
limited it at some point in a config file and forgot about it?
Indeed, I did do a
-c pack.threads=20 --window-memory=6g
to 'git repack', since the machine is a 20-core (40 threads) machine
with 126GB of RAM.
So I guess with these sized objects, even at 6GB per thread, it's not
enough to get a big enough Window for proper delta-packing?
This repo took >14hr to repack on 20 threads though ("compression" step
was very fast, but stuck 95% of the time in "writing objects"), so I can
only imagine how long a pack.threads=1 will take :)
But arent't the blobs sorted by some metric for reasonable delta-pack
locality, so even with a 6GB window it should have seen ~25 similar
objects to deltify against?
--
.marius