On Wed, Jan 24, 2018 at 11:03:47PM +0100, Ævar Arnfjörð Bjarmason wrote: > If you have a bunch of git repositories cloned of the same project on > the same filesystem, it would be nice of the packs that are produced > would be friendly to block-level deduplication. > > This would save space, and the blocks would be more likely to be in > cache when you access them, likely speeding up git operations even if > the packing itself is less efficient. > > Here's a hacky one-liner that clones git/git and peff/git (almost the > same content) and md5sums each 4k packed block, and sort | uniq -c's > them to see how many are the same: > > ( > cd /tmp && > rm -rf git*; > git clone --reference ~/g/git --dissociate git@xxxxxxxxxx:git/git.git git1 && > git clone --reference ~/g/git --dissociate git@xxxxxxxxxx:peff/git.git git2 && > for repo in git1 git2 > do > ( > cd $repo && > git repack -A -d --max-pack-size=10m > ) > done && > parallel "perl -MDigest::MD5=md5_hex -wE 'open my \$fh, q[<], shift; my \$s; while (read \$fh, \$s, 2**12) { say md5_hex(\$s) }' {}" ::: \ > $(find /tmp/git*/.git/objects/pack -type f)|sort|uniq -c|sort -nr|awk '{print $1}'|sort|uniq -c|sort -nr > ) > > This produces a total of 0 blocks that are the same. If after the repack > we throw this in there after the repack: > > echo 5be1f00a9a | git pack-objects --no-reuse-delta --no-reuse-object --revs .git/objects/pack/manual > > Just over 8% of the blocks are the same, and of course this pack > entirely duplicates the existing packs, and I don't know how to coerce > repack/pack-objects into keeping this manual-* pack and re-packing the > rest, removing any objects that exist in the manual-* pack. > > Documentation/technical/pack-heuristics.txt goes over some of the ideas > behind the algorithm, and Junio's 1b4bb16b9e ("pack-objects: optimize > "recency order"", 2011-06-30) seems to be the last major tweak to it. > > I couldn't find any references to someone trying to get this particular > use-case working on-list. I.e. to pack different repositories with a > shared history in such a way as to optimize for getting the most amount > of identical blocks within packs. > > It should be possible to produce such a pack, e.g. by having a repack > mode that would say: > > 1. Find what the main branch is > 2. Get its commits in reverse order, produce packs of some chunk-size > of commit batches. > 3. Pack all the remaining content > > This would delta much less efficiently, but as noted above the > block-level deduplication might make up for it, and in any case some > might want to use less disk space. > > Has anyone here barked up this tree before? Suggestions? Tips on where > to start hacking the repack code to accomplish this would be most > welcome. FWIW, I sidestep the problem entirely by using alternatives. Mike