Re: Git packs friendly to block-level deduplication

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jan 24 2018, Elijah Newren jotted:

> On Wed, Jan 24, 2018 at 2:03 PM, Ævar Arnfjörð Bjarmason
> <avarab@xxxxxxxxx> wrote:
>> If you have a bunch of git repositories cloned of the same project on
>> the same filesystem, it would be nice of the packs that are produced
>> would be friendly to block-level deduplication.
>>
>> This would save space, and the blocks would be more likely to be in
>> cache when you access them, likely speeding up git operations even if
>> the packing itself is less efficient.
>>
>> Here's a hacky one-liner that clones git/git and peff/git (almost the
>> same content) and md5sums each 4k packed block, and sort | uniq -c's
>> them to see how many are the same:
>
> <snip>
>
>>
>> Has anyone here barked up this tree before? Suggestions? Tips on where
>> to start hacking the repack code to accomplish this would be most
>> welcome.
>
> Does this overlap with the desire to have resumable clones?  I'm
> curious what would happen if you did the same experiment with two
> separate clones of git/git, cloned one right after the other so that
> hopefully the upstream git/git didn't receive any updates between your
> two separate clones.  (In other words, how much do packfiles differ in
> practice for different packings of the same data?)

If you clone git/git from Github twice in a row you get the exact same
pack, and AFAICT this is true of git in general (but may change between
versions).

If you make a local commit to that, copy the dir, and repack -A -d you
get the exact same packs again.

If you then make just one local commit to one copy (even with
--allow-empty) and repack, you get entirely differnt packs, in my test
2.5% of the blocks remain the same.

Obviously you could pack *that* new content incrementally and keep the
existing pack, but that won't help you with de-duping the initially
cloned data, which is what matters.



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux