Re: Git packs friendly to block-level deduplication

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jan 25, 2018 at 12:06:59AM +0100, Ævar Arnfjörð Bjarmason wrote:

> >> Has anyone here barked up this tree before? Suggestions? Tips on where
> >> to start hacking the repack code to accomplish this would be most
> >> welcome.
> >
> > Does this overlap with the desire to have resumable clones?  I'm
> > curious what would happen if you did the same experiment with two
> > separate clones of git/git, cloned one right after the other so that
> > hopefully the upstream git/git didn't receive any updates between your
> > two separate clones.  (In other words, how much do packfiles differ in
> > practice for different packings of the same data?)
> 
> If you clone git/git from Github twice in a row you get the exact same
> pack, and AFAICT this is true of git in general (but may change between
> versions).

That's definitely not guaranteed. It _tends_ to be the case over the
short term because we use --threads=1 on the server. But it may differ
if:

  - we repack on the server, which we do based on pushes

  - somebody pushes, even to another fork. The exact results depend
    on the packs in which we find the objects, and a new push may
    duplicate some existing objects but with a different representation,
    (e.g., a different delta base).

I'm actually interested in adding an etags-like protocol extension that
would work something like this:

  - server says "here's a pack, and its opaque tag is XYZ".

  - on resume, the client says "can I resume pack with tag XYZ"?

  - the server then decides if the on-disk state is sufficient for it to
    agree to recreate XYZ (e.g., number and identity of packs). If yes,
    then it resumes. If no, then it says "nope" and the two sides go
    through a normal fetch again.

The important thing is that the tag is opaque to the client. So a stock
implementation could use the on-disk state to decide. But a server could
choose to cache the packs it sends for a period of time (especially if
the client hangs up before we've sent the whole thing). We already do
this to a limited degree at GitHub in order to efficiently serve
multiple clients simultaneously fetching the same pack (e.g., imagine a
fleet of AWS machines all triggering "git fetch" at once).

I think that's a tangent to what you're looking for in this thread,
though.

-Peff



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux