Re: Continue git clone after interruption

Nicolas Pitre <nico@xxxxxxx> · Thu, 20 Aug 2009 14:41:25 -0400 (EDT)

On Thu, 20 Aug 2009, Jakub Narebski wrote:

> On Wed, 19 Aug 2009, Nicolas Pitre wrote:
> > You'll get the very latest revision for HEAD, and only that.  The size 
> > of the transfer will be roughly the size of a daily snapshot, except it 
> > is fully up to date.  It is however non resumable in the event of a 
> > network outage.  My proposal is to replace this with a "git archive" 
> > call.  It won't get all branches, but for the purpose of initialising 
> > one's repository that should be good enough.  And the "git archive" can 
> > be fully resumable as I explained.
> 
> It is however only 2.5 MB out of 37 MB that are resumable, which is 7%
> (well, that of course depends on repository).  Not that much that is
> resumable.

Take the Linux kernel then.  It is more like 75 MB.

> > Now to deepen that history.  Let's say you want 10 more revisions going 
> > back then you simply perform the fetch again with a --depth=10.  Right 
> > now it doesn't seem to work optimally, but the pack that is then being 
> > sent could be made of deltas against objects found in the commits we 
> > already have.  Currently it seems that a pack that also includes those 
> > objects we already have in addition to those we want is created, which 
> > is IMHO a flaw in the shallow support that shouldn't be too hard to fix.  
> > Each level of deepening should then be as small as standard fetches 
> > going forward when updating the repository with new revisions.
> 
> You would have the same (or at least quite similar) problems with 
> deepening part (the 'incrementals' transfer part) as you found with my
> first proposal of server bisection / division of rev-list, and serving
> 1/Nth of revisions (where N is selected so packfile is reasonable) to
> client as incrementals.  Yours is top-down, mine was bottom-up approach
> to sending series of smaller packs.  The problem is how to select size
> of incrementals, and that incrementals are all-or-nothing (but see also
> comment below).

Yes and no.  Combined with a slight reordering of commit objects, it 
could be possible to receive a partial pack and still be able to extract 
a bunch of full revisions.  The biggest issue is to be able to transfer 
revision x (75 MB for Linux), but revision x-1 usually requires only a 
few kilobytes, revision x-2 a few other kilobytes, etc.  Remember that 
you are likely to have only a few deltas from one revision to another, 
which is not the case for the very first revision you get.  A special 
mode to pack-object could place commit objects only after all the 
objects needed to create that revision.  So once you get a commit object 
on the receiving end, you could assume that all objects reachable from 
that commit are already received, or you had them locally already.

> In proposal using git-archive and shallow clone deepening as incrementals
> you have this small seed (how small it depends on repository: 50% - 5%)
> which is resumable.  And presumably with deepening you can somehow make
> some use from incomplete packfile, only part of which was transferred 
> before network error / disconnect.  And even tell server about objects
> which you managed to extract from *.pack.part.

yes.  And at that point resuming the transfer is just another case of 
shallow repository deepening.

> *NEW IDEA*
> 
> Another solution would be to try to come up with some sort of stable
> sorting of objects so that packfile generated for the same parameters
> (endpoints) would be always byte-for-byte the same.  But that might be
> difficult, or even impossible.

And I don't want to commit to that either.  Having some flexibility in 
object ordering makes it possible to improve on the packing heuristics.  
We certainly should avoid imposing strong restrictions like that for 
little gain.  Even the deltas are likely to be different from one 
request to another when using threads as one thread might be getting 
more CPU time than another slightly modifying the outcome.

> Well, we could send the list of objects in pack in order used later by
> pack creation to client (non-resumable but small part), and if packfile
> transport was interrupted in the middle client would compare list of 
> complete objects in part of packfile against this manifest, and sent
> request to server with *sorted* list of object it doesn't have yet.

Well... actually that's one of the item for pack V4.  Lots of SHA1s are 
duplicated in tree and commit objects, in addition to the pack index 
file.  With pack v4 all those SHA1s would be stored only once in a table 
and objects would index that table instead.

Still, that is not _that_ small though.  Just look at the size of the 
pack index file for the Linux repository to give you an idea.

> Server would probably have to check validity of objects list first (the
> object list might be needed to be more than just object list; it might
> need to specify topology of deltas, i.e. which objects are base for which
> ones).  Then it would generate rest of packfile.

I'm afraid that has the looks of something adding lots of complexity to 
a piece of git that is already quite complex already, namely 
pack-objects.  And there is already only a few individuals with their 
brain around it.

> > > It would be useful if it was possible to generate part of this rock-solid
> > > file for partial (range, resume) request, without need to generate 
> > > (calculate) parts that client already downloaded.  Otherwise server has
> > > to either waste disk space and IO for caching, or waste CPU (and IO)
> > > on generating part which is not needed and dropping it to /dev/null.
> > > git-archive you say has this feature.
> > 
> > "Could easily have" is more appropriate.
> 
> O.K.  And I can see how this can be easy done.
> 
> > > Next you need to tell server that you have those objects got using
> > > resumable download part ("git archive HEAD" in your proposal), and
> > > that it can use them and do not include them in prepared file/pack.
> > > "have" is limited to commits, and "have <sha1>" tells server that
> > > you have <sha1> and all its prerequisites (dependences).  You can't 
> > > use "have <sha1>" with git-archive solution.  I don't know enough
> > > about 'shallow' capability (and what it enables) to know whether
> > > it can be used for that.  Can you elaborate?
> > 
> > See above, or Documentation/technical/shallow.txt.
>  
> Documentation/technical/shallow.txt doesn't cover "shallow", "unshallow"
> and "deepen" commands from 'shallow' capability extension to git pack
> protocol (http://git-scm.com/gitserver.txt).

404 Not Found

Maybe that should be committed to git in Documentation/technical/  as 
well?

> > > Then you have to finish clone / fetch.  All solutions so far include
> > > some kind of incremental improvements.  My first proposal of bisect
> > > fetching 1/nth or predefined size pack is buttom-up solution, where
> > > we build full clone from root commits up.  You propose, from what
> > > I understand build full clone from top commit down, using deepening
> > > from shallow clone.  In this step you either get full incremental
> > > or not; downloading incremental (from what I understand) is not
> > > resumable / they do not support partial fetch.
> > 
> > Right.  However, like I said, the incremental part should be much 
> > smaller and therefore less susceptible to network troubles.
> 
> If you have 7% total pack size of git-archive resumable part, how small
> do you plan to have those incremental deepening?  Besides in my 1/Nth
> proposal those bottom-up packs werealso meant to be sufficiently small
> to avoid network troubles.

Two issues here: 1) people with slow links might not be interested in a 
deep history as it costs them time.  2) Extra revisions should typically 
require only a few KB each, therefore we might manage to ask for the 
full history after the initial revision is downloaded and salvage as 
much as we can if a network outage is encountered.  There is no need for 
arbitrary size, unless the user decides arbitrarily to get only 10 more 
revisions, or 100 more, etc.

> P.S. As you can see implementing resumable clone isn't easy...

I've been saying that all along for quite a while now.   ;-)

Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html