Re: Resumable git clone?

Josh Triplett <josh@xxxxxxxxxxxxxxxx> · Wed, 2 Mar 2016 08:41:18 -0800

On Wed, Mar 02, 2016 at 12:31:16AM -0800, Junio C Hamano wrote:
> Josh Triplett <josh@xxxxxxxxxxxxxxxx> writes:
> > I think several simpler optimizations seem
> > preferable, such as binary object names, and abbreviating complete
> > object sets ("I have these commits/trees and everything they need
> > recursively; I also have this stack of random objects.").
> 
> Given the way pack stream is organized (i.e. commits first and then
> trees and blobs that belong to the same delta chain together), and
> our assumed goal being to salvage objects from an interrupted
> transfer of a packfile, you are unlikely to ever see "I have these
> commits/trees and everything they need" that are salvaged from such
> a failed transfer.  So I doubt such an optimization is worth doing.

True for the resumable clone case.  For that optimization, I was
thinking of the "pull during the merge window" case that Al Viro was
also interested in optimizing.

> Besides it is very expensive to compute (the computation is done on
> the client side, so the cycles burned and the time the user has to
> wait is of much less concern, though); you'd essentially be doing
> "git fsck" to find the "dangling" objects.

Trading client-side computation for bandwidth can potentially be
worthwhile if you have plenty of local compute but a slow and metered
link.

> The list of what would be transferred needs to come in full from the
> server end, as the list names objects that the receiving end may not
> have seen, but the response by the client could be encoded much
> tightly.  For the full list of N objects from the server, we can
> think of your response to be a bitstream of N bits, each on-bit in
> which signals an unwanted object in the list.  You can optimize this
> transfer by RLE compressing the bitstream, for example.
> 
> As git-over-HTTP is stateless, however, you cannot assume that the
> server side remembers what it sent to the client (instead, the
> client side needs to re-post what it heard from the server in the
> previous exchange to allow the server side to use it after
> validating).  So "objects at these indices in your list" kind of
> optimization may not work very well in that environment.  I'd
> imagine that an exchange of "Here are the list of objects", "Give me
> these objects" done naively in full 40-hex object names would work
> OK there, though.

Good point.  Between statelessness and Duy's point about the client list
usually being smaller than the server list, perhaps it would make sense
to not have the server send a list at all, and just have the client send
its own list.

- Josh Triplett
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html