On Wed, Jan 24 2018, Elijah Newren jotted: > On Wed, Jan 24, 2018 at 2:03 PM, Ævar Arnfjörð Bjarmason > <avarab@xxxxxxxxx> wrote: >> If you have a bunch of git repositories cloned of the same project on >> the same filesystem, it would be nice of the packs that are produced >> would be friendly to block-level deduplication. >> >> This would save space, and the blocks would be more likely to be in >> cache when you access them, likely speeding up git operations even if >> the packing itself is less efficient. >> >> Here's a hacky one-liner that clones git/git and peff/git (almost the >> same content) and md5sums each 4k packed block, and sort | uniq -c's >> them to see how many are the same: > > <snip> > >> >> Has anyone here barked up this tree before? Suggestions? Tips on where >> to start hacking the repack code to accomplish this would be most >> welcome. > > Does this overlap with the desire to have resumable clones? I'm > curious what would happen if you did the same experiment with two > separate clones of git/git, cloned one right after the other so that > hopefully the upstream git/git didn't receive any updates between your > two separate clones. (In other words, how much do packfiles differ in > practice for different packings of the same data?) If you clone git/git from Github twice in a row you get the exact same pack, and AFAICT this is true of git in general (but may change between versions). If you make a local commit to that, copy the dir, and repack -A -d you get the exact same packs again. If you then make just one local commit to one copy (even with --allow-empty) and repack, you get entirely differnt packs, in my test 2.5% of the blocks remain the same. Obviously you could pack *that* new content incrementally and keep the existing pack, but that won't help you with de-duping the initially cloned data, which is what matters.