Re: Git packs friendly to block-level deduplication

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Thu, 25 Jan 2018 00:06:59 +0100

On Wed, Jan 24 2018, Elijah Newren jotted:

> On Wed, Jan 24, 2018 at 2:03 PM, Ævar Arnfjörð Bjarmason
> <avarab@xxxxxxxxx> wrote:
>> If you have a bunch of git repositories cloned of the same project on
>> the same filesystem, it would be nice of the packs that are produced
>> would be friendly to block-level deduplication.
>>
>> This would save space, and the blocks would be more likely to be in
>> cache when you access them, likely speeding up git operations even if
>> the packing itself is less efficient.
>>
>> Here's a hacky one-liner that clones git/git and peff/git (almost the
>> same content) and md5sums each 4k packed block, and sort | uniq -c's
>> them to see how many are the same:
>
> <snip>
>
>>
>> Has anyone here barked up this tree before? Suggestions? Tips on where
>> to start hacking the repack code to accomplish this would be most
>> welcome.
>
> Does this overlap with the desire to have resumable clones?  I'm
> curious what would happen if you did the same experiment with two
> separate clones of git/git, cloned one right after the other so that
> hopefully the upstream git/git didn't receive any updates between your
> two separate clones.  (In other words, how much do packfiles differ in
> practice for different packings of the same data?)

If you clone git/git from Github twice in a row you get the exact same
pack, and AFAICT this is true of git in general (but may change between
versions).

If you make a local commit to that, copy the dir, and repack -A -d you
get the exact same packs again.

If you then make just one local commit to one copy (even with
--allow-empty) and repack, you get entirely differnt packs, in my test
2.5% of the blocks remain the same.

Obviously you could pack *that* new content incrementally and keep the
existing pack, but that won't help you with de-duping the initially
cloned data, which is what matters.