Re: Resumable clone

Junio C Hamano <gitster@xxxxxxxxx> · Sun, 06 Mar 2016 19:55:23 -0800

Johannes Schindelin <Johannes.Schindelin@xxxxxx> writes:

> First of all: my main gripe with the discussed approach is that it uses
> bundles. I know, I introduced bundles, but they just seem too klunky and
> too static for the resumable clone feature.

We should make the mechanism extensible so that we can later support
multiple "alternate resource" formats, and "bundle" could be one of
the options, my current thinking is that the initial version should
use just a bare packfile to bootstramp, not a bundle.

The format being "static" is both a feature and a practical
compromise.  It is a feature to allow clone traffic, which is a
significant portion of the whole traffic to hosting sites, diverted
off of the core server network for a busy hosting site, saving both
networking and CPU cost.  And that benefit will be felt even if the
client has a good enough connection to the server that it does not
have to worry about resuming.  It is a practical compromise that the
mechanism will not be extensible for helping incremental fetch but I
heard that the server side statistics tells us that there aren't
many "duplicate incremental fetch" requests (i.e. many clients
having the same set of "have"s so that the server side can prepare,
serve, and cache the same incremental pack, which can be served on a
resumable transport, to help resuming clients by supporting
partial/range requests), I do not think it is practical to try to
use the same mechanism to help incremental and clone traffic.  One
size would not fit both here.

I think a better approach to help incremental fetches is along the
line of what was discussed in the discussion with Al Viro and others
the other day.  You'd need various building blocks implemented anew,
including:

 - A protocol extension to allow the client to tell the server a
   list of "not necessarily connected" objects that it has, so that
   the server side can exclude them from the set of objects the
   traditional "have"-"ack" exchange would determine to be sent when
   building a pack.

   - A design of deciding what "list of objects" is worth sending to
     the server side.  The total number of objects in the receiving
     end is an obvious upper bound, and it might be sufficient to
     send the whole thing as-is, but there may be more efficient way
     to determine this set [*1*]

 - A way to salvage objects from a truncated pack, as there is no
   such tool in core-git.

[Footnote]

*1* Once the traditional "have"-"ack" determines the set of objects
    the sender thinks the receiver may not have, we need to figure
    out the ones that happen to exist on the receiver end already,
    either because they were salvaged from a truncated pack data it
    received earlier, or perhaps because they already existed by
    fetching from a side branch (e.g. two repositories derived from
    the same upstream, updating from Linus's kernel tree by somebody
    who regularly interacts with linux-next tree), and exclude them
    from the set of objects sent from the sender.

    I've long felt that Eppstein's invertible bloom filter might be
    a good way to determine efficiently, among the set of objects
    the sending and the receiving ends have, which ones are common,
    but I didn't look into this deeply myself.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html