Re: Git packs friendly to block-level deduplication

Mike Hommey <mh@xxxxxxxxxxxx> · Thu, 25 Jan 2018 07:19:34 +0900

On Wed, Jan 24, 2018 at 11:03:47PM +0100, Ævar Arnfjörð Bjarmason wrote:
> If you have a bunch of git repositories cloned of the same project on
> the same filesystem, it would be nice of the packs that are produced
> would be friendly to block-level deduplication.
> 
> This would save space, and the blocks would be more likely to be in
> cache when you access them, likely speeding up git operations even if
> the packing itself is less efficient.
> 
> Here's a hacky one-liner that clones git/git and peff/git (almost the
> same content) and md5sums each 4k packed block, and sort | uniq -c's
> them to see how many are the same:
> 
>     (
>        cd /tmp &&
>        rm -rf git*;
>        git clone --reference ~/g/git --dissociate git@xxxxxxxxxx:git/git.git git1 &&
>        git clone --reference ~/g/git --dissociate git@xxxxxxxxxx:peff/git.git git2 &&
>        for repo in git1 git2
>        do
>            (
>                cd $repo &&
>                git repack -A -d --max-pack-size=10m
>            )
>        done &&
>        parallel "perl -MDigest::MD5=md5_hex -wE 'open my \$fh, q[<], shift; my \$s; while (read \$fh, \$s, 2**12) { say md5_hex(\$s) }' {}" ::: \
>            $(find /tmp/git*/.git/objects/pack -type f)|sort|uniq -c|sort -nr|awk '{print $1}'|sort|uniq -c|sort -nr
>     )
> 
> This produces a total of 0 blocks that are the same. If after the repack
> we throw this in there after the repack:
> 
>     echo 5be1f00a9a | git pack-objects --no-reuse-delta --no-reuse-object --revs .git/objects/pack/manual
> 
> Just over 8% of the blocks are the same, and of course this pack
> entirely duplicates the existing packs, and I don't know how to coerce
> repack/pack-objects into keeping this manual-* pack and re-packing the
> rest, removing any objects that exist in the manual-* pack.
> 
> Documentation/technical/pack-heuristics.txt goes over some of the ideas
> behind the algorithm, and Junio's 1b4bb16b9e ("pack-objects: optimize
> "recency order"", 2011-06-30) seems to be the last major tweak to it.
> 
> I couldn't find any references to someone trying to get this particular
> use-case working on-list. I.e. to pack different repositories with a
> shared history in such a way as to optimize for getting the most amount
> of identical blocks within packs.
> 
> It should be possible to produce such a pack, e.g. by having a repack
> mode that would say:
> 
>  1. Find what the main branch is
>  2. Get its commits in reverse order, produce packs of some chunk-size
>     of commit batches.
>  3. Pack all the remaining content
> 
> This would delta much less efficiently, but as noted above the
> block-level deduplication might make up for it, and in any case some
> might want to use less disk space.
> 
> Has anyone here barked up this tree before? Suggestions? Tips on where
> to start hacking the repack code to accomplish this would be most
> welcome.

FWIW, I sidestep the problem entirely by using alternatives.

Mike