Duy Nguyen <pclouds@xxxxxxxxx> writes: > Resumable clone is happening. See [1] for the basic idea, [2] and [3] > for some preparation work. I'm sure you can help. Once you've gone > through at least [1], I think you can pick something (e.g. finalizing > the protocol, update the server side, or git-clone....) > > [1] http://thread.gmane.org/gmane.comp.version-control.git/285921 > [2] http://thread.gmane.org/gmane.comp.version-control.git/288080/focus=288150 > [3] http://thread.gmane.org/gmane.comp.version-control.git/288205/focus=288222 I think your response needs to be refined with a bit higher level overview, though. Here are some thoughts to summarize the discussion and to extend it. I think the right way to think about this is that we are adding a capability for the server to instruct the clients: I prefer not to serve a full clone to you in the usual route if I can avoid it. You can help me by going to an alternate resource and populate your history first and then coming back to me for an additional fetch to complete the history if you want to. Doing so would also help you because that alternate resource can be a static file (or two) that you can download over a resumable transport (like static files served over HTTPS). That alternate resource could be just an old-style bundle file (e.g. kernel.org prepares such a bundle file for Linus's kernel repository and makes it available on CDN on a weekly basis; cf. https://kernel.org/cloning-linux-from-a-bundle.html). One downside of using the old-style bundle is that it would weigh about the same as the fully repacked bare repository itself, and would require the same amount of CPU and disk resource to generate as it would take to repack. The "split bundle" discussion with Jeff King is about one possible way to reduce that waste. The old-style bundle is just a header file tucked in front of a packfile, and by introducing a new bundle format that stores only the header part in a file that points at an existing packfile, we can reduce the waste. A few patches from me on "bundle" and "index-pack --clone-bundle" sent for the past several days are about that approach. During a repack the server operators periodically make, we can also create the header part of the new bundle format that points at the full packfile that is produced in order to serve the regular "fetch/push" traffic. My response to [3] in the thread further points at a new direction. The "alternate resource" does not have to be a bundle, but can be just a full packfile (i.e. pack-$name.pack). After a full repack, the server operators can make the packfile available to clients over a resumable transport. The client has to run "index-pack" on the downloaded pack-$name.pack to generate the "pack-$name.idx" file in order to make it usable, so the logic to implement "--clone-bundle" introduced initially for the "split bundle" approach can be repurposed to be run on the client. With a single pack-$name.pack file, the client can - Place it in .git/objects/pack in an empty repository; - Generate corresponding pack-$name.idx file next to it; - Learn where the tips of histories (i.e. "all objects that are reachable from these objects are already available in this repository") are. And the above is sufficient to do the "coming back to me for an additional fetch" efficiently. The tips of histories can be sent as extra "have" records during such a fetch with a minor update to the "fetch" code. So what remains? Here is a rough and still slushy outline: - A new method, prime_clone(), in "struct transport" for "git clone" client to first call to learn the location of the "alternate resource" from the server. - The server side endpoint does not have to be, and I think it should not be, implemented as an extension to the current upload-pack protocol. It is perfectly fine to add a new "git prime-clone" program next to existing "git upload-pack" and "git receive-pack" programs and drive it through the git-daemon, curl remote helper, and direct execution over ssh. - The format of the returned "answer" needs to be designed. It must be able to express: - the location of the resource, i.e. a URL; - the type of resource, if we want this to be extensible. I think we should initially limit it to "a single full history .pack", so from that point of view this may not be absolutely necessary, but we already know that we may want to say "go there and you will find an old-style bundle file" to support the kernel.org CDN, and we may also want to support Jeff's "split bundle" or Shawn's ".info" file. A resource poor (read: personal) machine that hosts a personal of a popular project might want to name a "git clone" URL for that popular project it forked from (e.g. "Clone Linus's repository from kernel.org and then come back here for incremental fetch"). - A new method, download_primer(), in "struct transport" to download the "alternate resource" learned from an earlier call to the prime_clone() method. - I expect that a typical answer to the "prime-clone" (see above) request would be an HTTP(s) URL for a single pack file, or the header file of a "split bundle" pair. The curl remote helper would implement this as an equivalent of "wget -c" of these files. - The answer to "prime-clone" request could name "git://" URL, and it is OK to design a new server-side endpoint to respond to download_primer() method, aka "a minimal resumable download service via git-daemon" Duy mentioned in [1]. - To support the case where the server responds with "a single full .pack file" to prime_clone(): - "index-pack" needs to be extended to compute the tips of the history contained within (the logic is already done, but if the "split bundle" output is not the best one to use, we may need to update the output format); - The way the "tips of the history" is told to "fetch-pack" that does the final incremental fetch to the original site needs to be designed. It could be a "throw-away temporary sub-hierarchy somewhere in refs/" I alluded to in [2], but there may be better designs (e.g. naming a split bundle as "--reference" to a fresh invocation of the "git clone" command). - Update "git clone" (builtin/clone.c). - Refactor it to make it easier to wedge in the new code (below), if necessary. - Teach it a new "--resume" option. - When the command is run with this option, no other option must be given. - If the existing (half cloned) repository is not marked as resumable, the command must fail. - The first few steps below are skipped (i.e. we do not create worktree and gitdir, we do not do init_db(), we do not write refspec configuration). - Let the original code run up to the point where it creates worktree and gitdir, does init_db(), and writes the refspec configuration. - If some options are given that makes it less efficient to do the "prime from an alternate resource and then fetch", do not do anything special and let the original code run to the end (i.e. no resuming). This may include cases where "--reference" is given, indicating that the bulk of objects are already available locally. The details of this determination need to be worked out. - If we are doing the "prime from an alternate and then fetch", before the current code calls transport_fetch_refs(), call a new transport API function, transport_prime_clone(), to learn if the server wants us to prime from an "alternate resource". If not, let the original code run to the end. - If we are still doing the "prime from an alternate and then fetch", mark the repository as resumable, and call transport_download_primer(), a new transport API function. This will implement the "retry until successfully download the whole thing" and "continue from where its earlier incarnation was killed". - Once transport_download_primer() finishes downloading the "alternate resource", prime the object store and record the tips of the history. - Perform an incremental "fetch" against the original repository. - Finalize the clone, e.g. point our HEAD to a correct branch, start our 'master' (or whatever primary branch name is) branch at a correct place in the history and check the files out to the working tree. [1] http://thread.gmane.org/gmane.comp.version-control.git/288080/focus=288161 [2] http://thread.gmane.org/gmane.comp.version-control.git/288205/focus=288222 -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html