Re: resumable git-clone?

"Shawn O. Pearce" <spearce@xxxxxxxxxxx> · Tue, 7 Aug 2007 23:59:46 -0400

Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> wrote:
> I was on a crappy connection and it was frustrated seeing git-clone
> reached 80% then failed, then started over again. Can we support
> resumable git-clone at some level? I think we could split into several
> small packs, keep fetched ones, just get missing packs until we have
> all.

This is uh, difficult over the native git protocol.  The problem
is the native protocol negotiates what the client already has and
what it needs by comparing sets of commits.  If the client says
"I have commit X" then the server assumes it has not only commit
X _but also every object reachable from it_.

Now packfiles are organized to place commits at the front of the
packfile.  So a truncated download will give the client a whole
host of commits, like maybe all of them, but none of the trees
or blobs associated with them as those come behind the commits.
Worse, the commits are sorted most recent to least recent.  So if
the client claims he has the very first commit he received, that
is currently an assertion that he has the entire repository.

I have been thinking about this resumable fetch idea for the native
protocol for a few days now, like since the last time it came up
on #git.

One possiblity is to have the client store locally in a temporary
file the list of wants and the list of haves it sent to the server
during the last fetch.

During a resume of a packfile download we actually just replay this
list of wants/haves, even if the server has newer data.  We also tell
the server which object we last successfully downloaded (its SHA-1).

The server would only accept the resumed want list if all of the
wants are reachable from its current refs.  If one or more aren't
then they are just culled from the want list; this way you can still
successfully resume a download of say git.git where pu rebases often.
You just might not get pu without going back for it.

If the server always performs a very stable (meaning we don't ever
change the sorting order!) and deterministic sorting of the objects
in the packfile then given the same list of wants/haves and a
"prior" point it can pickup from where it left off.

At worst we are retransmitting one whole object again, e.g. the
client had all but the last byte of the object, so it was no good.
I'm willing to say we do the full object retransmission in case the
object was recompressed on the server between the first fetch and
the second.  It just simplifies the restart.

Probably not that difficult.  The hardest part is committing to the
object sorting order so that when we ask for a restart we *know*
we didn't miss an object.

> I didn't clone via http so I don't know if http supports resumable.

This would have a better chance at doing a resume.  Looking at the
code it looks like we do in fact resume a packfile download if it
was truncated.

-- 
Shawn.
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html