thin packs ending up fat

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[This ended up long; the gist of it is that we often fail to find thin
 deltas, making small fetches much bigger than they could be. Read on
 for details].

Recently I was doing some work related to bundles containing thin packs,
and I wanted to generate some thin packs. Much to my surprise, all of my
packs ended up having no thin objects!

It turns out that when packing a subset of a fully packed repo (as we do
for a bundle or for a fetch), we tend not to make thin packs at all.
The culprit is this logic in try_delta:

        /*
         * We do not bother to try a delta that we discarded
         * on an earlier try, but only when reusing delta data.
         */
        if (reuse_delta && trg_entry->in_pack &&
            trg_entry->in_pack == src_entry->in_pack &&
            trg_entry->in_pack_type != OBJ_REF_DELTA &&
            trg_entry->in_pack_type != OBJ_OFS_DELTA)
                return 0;

This does not try to delta objects which are found in the same pack,
under the assumption that we would have tried them last time and found
that they didn't work. For a complete repack, it works well. We don't
miss useful deltas, and it saves a lot of CPU time.

For a thin pack, though, it means we'll typically fail to actually use
the boundary objects. I think what happens is this:

  1. You start at commit C which contains a file with content A. You
     build commit C', which contains content A'.

  2. You repack the repository, putting A and A' in the pack. Our delta
     rules tend to create deltas in reverse-recency order, so A
     typically is stored as a delta based on A', and A' is stored
     whole.

  3. You try to create a thin pack (e.g., with "git bundle create
     foo.bundle C..C'"). We know that we want to send A', and that we
     can assume the other side has A to use as a base in our thin pack.

     But the logic above says "A and A' are both in the same pack, and
     we didn't find a delta for A'. They must not be a good match". But
     in this case it is because our deltas go in the reverse direction,
     and we never tried them.

So I did a little experimenting (scripts and inputs are at the end of
the email). I generated a thin pack for a subset of objects from my
fully packed repo, measuring the size of the resulting pack and the CPU
time used. I did this both with stock git, and a version of git with the
above optimization removed.

I used two datasets to generate the input to pack-objects. In the first,
I used the reflog entries from my refs/remotes/origin/master ref. That
simulates the workload of upload-packs performed for me by kernel.org
over the last month or so. In the second dataset, I took packed the
objects between git v1.0.0 and v1.1.0, between v1.1.0 and v1.2.0, and so
on. This was an attempt to simulate somebody who fetches much less
frequently, or perhaps somebody who generates bundles to sneaker-net
individual versions.

For each dataset, I calculated the average size (in bytes) over all of
the generated packs in the dataset and the average CPU time used per
pack (in seconds).

                  dataset
            | fetches | tags
---------------------------------
     before | 53358   | 2750977
size  after | 32315   | 2652939
     change |   -39%  |      -4%
---------------------------------
     before |  0.18   | 1.12
CPU   after |  0.19   | 1.50
     change |    +5%  |     +33%

For the case of small, frequent packs, we saved a lot of space for a
little bit of extra CPU. But for larger packs, our savings were much
more limited, and we spent a lot more CPU time.  That makes sense; this
is really beneficial for finding extra matches at the boundaries, and
for cases where you can avoid sending an entire undeltified object at
all for a given path.

So I think this is worth pursuing as an optimization to save bandwidth
on transfers, but it would require figuring out the right times to turn
it on. My initial idea was that we would want it off whenever we were
doing a thin pack. But the "tags" dataset shows that it's probably not
worth the trouble for large packs due to the extra CPU time.

I have a feeling that the case it is generally catching is this "reverse
delta" case. I.e., A is based off of B, we are including B, but A is
available as a thin base. But that can probably be extended into a delta
chain (A is based from B which is based on C; we want to include B and
C, but can build off of A as a base. We'll keep the delta from B to C,
but we would want to delta C off of A). So detecting that they are part
of the same chain might require walking the chains.

Maybe it is enough to simply turn off this optimization if the potential
delta source is not being included in the pack (i.e., we are using
--thin and it is a boundary object). Because if both objects are being
sent, we will just end up reusing the delta that goes in the reverse
direction anyway.

Or maybe there is some other clever way of detecting this situation. I'm
open to suggestions.

-Peff

---
This is the test script I used (feed the dataset files on stdin, get
lines of sizes and CPU times on stdout):

-- >8 --
#!/bin/sh

while read revs; do
	for i in $revs; do
		echo $i
	done >input
	echo >&2 "==> $revs"
	time -o time.out -f "%e" \
		git pack-objects --thin --revs --stdout <input >pack.out
	echo "`stat --format=%s pack.out` `cat time.out`"
done
-- 8< --

and the "fetch" dataset:

-- >8 --
8b0e15fa95e11965f18c8d2585dc8ffd9bfc9356 ^7f41b6bbe3181dc4d1687db036bf22316997a1bf
34c4461ae3b353e8c931565d5527b98ed12e3735 ^8b0e15fa95e11965f18c8d2585dc8ffd9bfc9356
463b0ea22b5b9a882e8140d0308433d8cbd0d1fe ^34c4461ae3b353e8c931565d5527b98ed12e3735
288396994f077eec7e7db0017838a5afbfbf81e3 ^463b0ea22b5b9a882e8140d0308433d8cbd0d1fe
05f6edcd2a418a88eeb953d51408a6aeef312f5c ^288396994f077eec7e7db0017838a5afbfbf81e3
08cfdbb88cd6225b4fc4b8a3cecd0e01758c835d ^05f6edcd2a418a88eeb953d51408a6aeef312f5c
87009edcbd0b4987ccb7ba050a1efe368a315753 ^08cfdbb88cd6225b4fc4b8a3cecd0e01758c835d
8963314c77af9a4eda5dcbdbab3d4001af83ad81 ^87009edcbd0b4987ccb7ba050a1efe368a315753
10b2a48113b8ab6b8f48229eb40fc3637ce025ae ^8963314c77af9a4eda5dcbdbab3d4001af83ad81
997a1946a55cafb992c4ba8e5e0795aa73f5a4a9 ^10b2a48113b8ab6b8f48229eb40fc3637ce025ae
e8e1c29021da446d0c50573ef9619bf74f515c20 ^997a1946a55cafb992c4ba8e5e0795aa73f5a4a9
87bf9a7048c623b3567f612ca3e67a0d412fc83d ^e8e1c29021da446d0c50573ef9619bf74f515c20
ee6dfb2d83ba1b057943e705f707fa27e34e47f9 ^87bf9a7048c623b3567f612ca3e67a0d412fc83d
4cb6764227173a6483edbdad09121651bc0b01c3 ^ee6dfb2d83ba1b057943e705f707fa27e34e47f9
8a042478967679b0dab3137f5f18367a0ffdc48a ^4cb6764227173a6483edbdad09121651bc0b01c3
248dbbe83256202f0edd6e1468d01cfbe27fd733 ^8a042478967679b0dab3137f5f18367a0ffdc48a
bc1bbe0c19a6ff39522b4fa3259f34150e308e1f ^248dbbe83256202f0edd6e1468d01cfbe27fd733
4d2440fe0daa9ad1556dfd220af8b3a883cf849d ^bc1bbe0c19a6ff39522b4fa3259f34150e308e1f
f56ef114eeefee755f422e6cbef2d83974cb90f1 ^4d2440fe0daa9ad1556dfd220af8b3a883cf849d
e14d63198867c545d0662afc00bf7be048bf2231 ^f56ef114eeefee755f422e6cbef2d83974cb90f1
017d1e134545db0d162908f3538077eaa1f34fb6 ^e14d63198867c545d0662afc00bf7be048bf2231
fc14b89a7e6899b8ac3b5751ec2d8c98203ea4c2 ^017d1e134545db0d162908f3538077eaa1f34fb6
406da7803217998ff6bf5dc69c55b1613556c2f4 ^fc14b89a7e6899b8ac3b5751ec2d8c98203ea4c2
7e02a6c63a183270b726bb21640059ae16fa48ae ^406da7803217998ff6bf5dc69c55b1613556c2f4
4cb5d10b14dcbe0155bed9c45ccb94e83bd4c599 ^7e02a6c63a183270b726bb21640059ae16fa48ae
9859a023fef30ffebdd22ad9639c587ac720b8b6 ^4cb5d10b14dcbe0155bed9c45ccb94e83bd4c599
57526fde5df201a99afa6d122c3266b3a1c5673a ^9859a023fef30ffebdd22ad9639c587ac720b8b6
73c6b3575bc638b7096ec913bd91193707e2265d ^57526fde5df201a99afa6d122c3266b3a1c5673a
10f4eb652ee4e592f91f638e579d1afcb96c0408 ^73c6b3575bc638b7096ec913bd91193707e2265d
ee228024933069b93ce23a1bd5eeb7ae12c792f2 ^10f4eb652ee4e592f91f638e579d1afcb96c0408
d16520499d2652b5b59dfb25f9cf2d56a4c6913a ^ee228024933069b93ce23a1bd5eeb7ae12c792f2
876a6f4991abdd72ea707b193b4f2b831096ad3c ^d16520499d2652b5b59dfb25f9cf2d56a4c6913a
8d68493f20db71abeb77adc251b2a116fe45cdaa ^876a6f4991abdd72ea707b193b4f2b831096ad3c
3daff7c31998faedbe0dd7e2b8651e351be40d64 ^8d68493f20db71abeb77adc251b2a116fe45cdaa
e443bdfe1e8e1ef8b3665cfd1c1295bd73e13773 ^3daff7c31998faedbe0dd7e2b8651e351be40d64
5d6dfc7cb140a6eb90138334fab2245b69bc8bc4 ^e443bdfe1e8e1ef8b3665cfd1c1295bd73e13773
ec330158ec04849fe5ff2cb8749797cd63ae592b ^5d6dfc7cb140a6eb90138334fab2245b69bc8bc4
17b4e93d5b849293e6a3659bbc4075ed8a6e97e2 ^ec330158ec04849fe5ff2cb8749797cd63ae592b
4570aeb0d85f3b5ff274b6d5a651c2ee06d25d76 ^17b4e93d5b849293e6a3659bbc4075ed8a6e97e2
247f9d23da8cfd255533433ad2aa07d172afac0b ^4570aeb0d85f3b5ff274b6d5a651c2ee06d25d76
eac2d83247ea0a265d923518c26873bb12c33778 ^247f9d23da8cfd255533433ad2aa07d172afac0b
beecc7ab65b31c5471331e64acaa3f722125ea67 ^eac2d83247ea0a265d923518c26873bb12c33778
7e521640c80b4bb871bca7a9259621a7abb303e7 ^beecc7ab65b31c5471331e64acaa3f722125ea67
0e1cfc52de002e2d9b0e6562e8672fee3bf45a67 ^7e521640c80b4bb871bca7a9259621a7abb303e7
-- 8< --

and the "tags" dataset:

-- >8 --
v1.1.0 ^v1.0.0
v1.2.0 ^v1.1.0
v1.3.0 ^v1.2.0
v1.4.0 ^v1.3.0
v1.5.0 ^v1.4.0
v1.6.0 ^v1.5.0
v1.7.0 ^v1.6.0
-- 8< --
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]