On Wed, 19 Aug 2009, Johannes Schindelin wrote: > Hi, > > On Tue, 18 Aug 2009, Nicolas Pitre wrote: > > > On Wed, 19 Aug 2009, Johannes Schindelin wrote: > > > > > But seriously, I miss a very important idea in this discussion: we > > > control the Git source code. So we _can_ add a upload_pack feature > > > that a client can ask for after the first failed attempt. > > > > Indeed. So what do you think about my proposal? It was included in my > > first reply to this thread. > > Did you not talk about an extension of the archive protocol? That's not > what I meant. The archive protocol can be disabled for completely > different reasons than to prevent restartable clones. And those reasons are? > But you brought up an important point: shallow repositories. > > Now, the problem, of course, is that if you cannot even get a single ref > (shallow'ed to depth 0 -- which reminds me: I think I promised to fix > that, but I did not do that yet) due to intermittent network failures, you > are borked, as you said. Exact. > But here comes an idea: together with Nguyễn's sparse series, it is > conceivable that we support a shallow & narrow clone via the upload-pack > protocol (also making mithro happy). The problem with narrow clones was > not the pack generation side, that is done by a rev-list that can be > limited to certain paths. The problem was that we end up with missing > tree objects. However, if we can make a sparse checkout, we can avoid > the problem. Sure, if you can salvage as much as you can from a partial pack and create a shallow and narrow clone out of it then it should be possible to do some restartable clone. I still think this might be much less complex to achieve through git-archive, especially if some files i.e. objects are large enough to expose themselves to network outage. It is like the same issue as being able to fetch at least one revision but to a lesser degree. You might be able to get that first revision through multiple attempts by gathering missing objects on each attempt. But if you encounter an object large enough you then might be unlucky enough not to be able to transfer it all before the next network failure. With a simple extension to git-archive, any object content could be resumed many times from any offset. Then, deepening the history should make use of deltas through the pack protocol which should hopefully consist of much smaller transfers and therefore less prone to network outage. That could be sketched like this, supposing user runs "git clone git://foo.bar/baz": 1) "git ini baz" etc. as usual. 2) "git ls-remote git://foo.bar/baz HEAD" and store the result in .git/CLONE_HEAD so not to be confused by the remote HEAD possibly changing before we're done. 3) "git archive --remote=git://foo.bar/baz CLONE_HEAD" and store the result locally. Keep track of how many files are received, and how many bytes for the currently received file. 4) if network connection is broken, loop back to (3) adding --skip=${nr_files_received},${nr_bytes_in_curr_file_received} to the git-archive argument list. REmote server simply skips over specified number of files and bytes into the next file. 5) Get content from remote commit object for CLONE_HEAD somehow. (?) 6) "git add . && git write-tree" and make sure the top tree SHA1 matches the one in the commit from (5). 7) "git hash-object -w -t commit" with data obtained in (5), and make sure it matches SHA1 from CLONE_HEAD. 8) Update local HEAD with CLONE_HEAD and set it up as a shallow clone. Delete .git/CLONE_HEAD. 9) Run "git fetch" with the --depth parameter to get more revisions. Notes: - This mode of operation should probably be optional, like by using --safe or --restartable with 'git clone'. And since this mode of operation is really meant for people with slow and unreliable network connections, they're unlikely to wish for the whole history to be fetched. Hence this mode could simply be triggered by the --depth parameter to 'git clone' which would provide a clear depth value to use in (9). - If the transfer is interrupted locally with ^C then it should be possible to resume it by noticing the presence of .git/CLONE_HEAD up front. DEtermining how many files to skip when resuming with git-archive can be done with $((`git ls-files -o | wc -l` - 1)) and $(git ls-files -o | tail -1 | wc -c). - That probably would be a good idea to have a tgz format to 'git archive' which might be simpler to deal with than the zip format. - Step (3) could be optimized in many ways, like by directly using hash-object and update-index, or by using a filter to pipe the result directly into fast-import. - So to say that the above should be pretty easy to implement even with a shell script. A builtin version could then be made if this proves to actually be useful. And the server remains stateless with no additional caching needed which would go against any attempt at making a busy server like git.kernel.org share as much of the object store between plenty of mostly identical repositoryes. > Note: this is not well thought-through, but just a brainstorm-like answer > to your ideas. And so is the above. Nicolas