Re: Continue git clone after interruption

Jakub Narebski <jnareb@xxxxxxxxx> · Fri, 21 Aug 2009 23:41:30 +0200

On Fri, 21 Aug 2009, Nicolas Pitre wrote:
> On Fri, 21 Aug 2009, Jakub Narebski wrote:
>> On Thu, 20 Aug 2009, Nicolas Pitre wrote:
>>> On Thu, 20 Aug 2009, Jakub Narebski wrote:

>>>> It is however only 2.5 MB out of 37 MB that are resumable, which is 7%
>>>> (well, that of course depends on repository).  Not that much that is
>>>> resumable.
>>> 
>>> Take the Linux kernel then.  It is more like 75 MB.
>> 
>> Ah... good example.
>> 
>> On the other hand Linux is fairly large project in terms of LoC, but
>> it had its history cut when moving to Git, so the ratio of git-archive
>> of HEAD to the size of packfile is overemphasized here.
> 
> That doesn't matter.  You still need that amount of data up front to do 
> anything.  And I doubt people with slow links will want the full history 
> anyway, regardless if it goes backward 4 years or 18 years back.

On the other hand unreliable link doesn't need to mean unreasonably
slow link.

Hopefully GitTorrent / git-mirror-sync would finally come out of 
vapourware and wouldn't share the fate of Duke Nukem Forever ;-),
and we would have this as an alternative to clone large repositories.
Well, supposedly there is some code, and last year GSoC project at
least shook the dust out of initial design and made it simplier, IIUC.

>> You make use here of a few facts:
[...]

>> 2. There is support in git pack format to do 'deepening' of shallow
>>    clone, which means that git can generate incrementals in top-down
>>    order, _similar to how objects are ordered in packfile_.
> 
> Well... the pack format was not meant for that "support".  The fact that 
> the typical object order used by pack-objects when serving fetch request 
> is amenable to incremental top-down updates is rather coincidental and 
> not really planned.

Ooops.  I meant "git pack PROTOCOL" here, not "git pack _format_".
the one about want/have/shallow/deepen exchange.

[...]
>>> A special 
>>> mode to pack-object could place commit objects only after all the 
>>> objects needed to create that revision.  So once you get a commit object 
>>> on the receiving end, you could assume that all objects reachable from 
>>> that commit are already received, or you had them locally already.
>> 
>> Yes, with such mode (which I think wouldn't reduce / interfere with
>> ability for upload-pack to pack more tightly by reordering objects
>> and choosing different deltas) it would be easy to do a salvage of
>> a partially completed / transferred packfile.  Even if there is no
>> extension to tell git server which objects we have ("have" is only
>> about commits), if there is at least one commit object in received
>> part of packfile, we can try to continue from later (from more);
>> there is less left to download.
> 
> Exact.  Suffice to set the last received commit(s) (after validation) as 
> one of the shallow points.

Assuming that received commit is full (has all prerequisites), and
is connected to the rest of body of partially [shallow] cloned 
repository.

>>>> Documentation/technical/shallow.txt doesn't cover "shallow", "unshallow"
>>>> and "deepen" commands from 'shallow' capability extension to git pack
>>>> protocol (http://git-scm.com/gitserver.txt).
>>> 
>>> 404 Not Found
>>> 
>>> Maybe that should be committed to git in Documentation/technical/  as 
>>> well?
>> 
>> This was plain text RFC for the Git Packfile Protocol, generated from
>> rfc2629 XML sources at http://github.com/schacon/gitserver-rfc
> 
> I suggest you track it down and prod/propose a version for merging in 
> the git repository.

Scott Chacon was (and is) CC-ed.

I don't know if you remember mentioned discussion about pack protocol, 
stemming from the fact that some of git (re)implementations (Dulwich,
JGit) failed to implement it properly, where properly = same as 
git-core, i.e. the original implementation in C... because there were
not enough documentation.

>>>> P.S. As you can see implementing resumable clone isn't easy...
>>> 
>>> I've been saying that all along for quite a while now.   ;-)
>> 
>> Well, on the other hand side we have example of how long it took to
>> come to current implementation of git submodules.  But if finally
>> got done.
> 
> In this case there is still no new line of code what so ever.  Thinking 
> it through is what takes time.

Measure twice, cut once :-)

In this case I think design upfront is a good solution.

>> The git-archive + deepening approach you proposed can be split into
>> smaller individual improvements.  You don't need to implement it all
>> at once.
[...]

>> 3. Create new git-archive pseudoformat, used to transfer single commit
>>    (with commit object and original branch name in some extended header,
>>    similar to how commit ID is stored in extended pax header or ZIP
>>    comment).  It would imply not using export-* gitattributes.
> 
> The format I was envisioning is really simple:
> 
> First the size of the raw commit object data content in decimal, 
> followed by a 0 byte, followed by the actual content of the commit 
> object, followed by a 0 byte.  (Note: this could be the exact same 
> content as the canonical commit object data with the "commit" prefix, 
> but as all the rest are all blob content this would be redundant.)
> 
> Then, for each file:
> 
>  - The file mode in octal notation just as in tree objects
>  - a space
>  - the size of the file in decimal
>  - a tab
>  - the full path of the file
>  - a 0 byte
>  - the file content as found in the corresponding blob
>  - a 0 byte
> 
> And finally some kind of marker to indicate the end of the stream.
> 
> Put the lot through zlib and you're done.

So you don't want to just tack commit object (as extended pax header,
or a comment - if it is at all possible) to the existing 'tar' and
'zip' archive formats.  Probably better to design format from scratch.

>> 4. Implement alternate ordering of objects in packfile, so commit object
>>    is put immediately after all its prerequisites.
> 
> That would require some changes in the object enumeration code which is 
> an area of the code I don't know well.

Oh.

-- 
Jakub Narebski
Poland
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html