On Fri, Mar 11, 2011 at 06:10, Alexander Miseler <alexander@xxxxxxxxxx> wrote: > On 11.03.2011 14:48, Nguyen Thai Ngoc Duy wrote: >>> On Fri, Mar 11, 2011 at 01:18:45PM +0100, Alexander Miseler wrote: >>>> >>>> Resumable clone >> >> A simpler way to restartable clone is to facilitate bundles (Nicolas' >> idea). Some glue is needed to teach git-fetch/git-daemon to use the >> bundles, and git-push to automatically create bundles periodically (or >> a new command that can be run from cron). I think this way fit in GSoC >> scope better. I think the cached bundle idea is horrifically stupid in the face of the subsequent cached pack idea. JGit already implements cached packs, and it works very well. The feature just needs to be back-ported to builtin/pack-objects.c, along with some minor edits to my RFC patch to git-repack.sh to be able to construct the cached pack. Unlike a cached bundle, the cached pack doesn't eat up useless disk space on the server. Its still the only copy of the object content, which keeps server disk usage (and buffer cache usage) lower. A protocol extension in the fetch-pack/upload-pack protocol is required to allow pack-objects to delimit the early thin-pack from the later cached pack, as well as supply the cached-pack's identity. A client who breaks the connection after the leading thin-pack has been received could restart by downloading the cached pack from a specific starting byte. Without waiting for pack v4, cached packs can shave a full minute of server CPU time during a clone of the linux-2.6 kernel. That's nothing to laugh at, its a full CPU minute. These days a full CPU minute is a lot of computational work. It also is pretty backwards compatible with the current network protocol, even ancient Git clients can still use the cached pack during an initial clone, saving a lot of server resources. With cached packs, organizations like Gentoo wouldn't need to implement bizarre hacks in their upload-pack binary to prevent clones over git:// from their servers. Its also well within GSoC size scope. I think the hard part is understanding enough of how the revision walker works inside of pack-objects in order to construct the leading thin-pack. >> [1] The idea of my work above was mentioned elsewhere, history is cut >> down by path. Each file/dir's history a very long chain of deltas. We >> can stream deltas (in parallel if needed) over the wire, resuming >> where the chain stops last time. > > This may all be aiming to short. IMHO the best solution would be some > generic way for the client to specify exactly what it wants to get and to > get just that. This would lay the groundwork for: > - lazy clones > - sparse clones > - resumable cloning > - resumable fetching Junio and I would like see narrow checkout code re-implemented to support obtaining only a subset of the paths from the remote. Once that is implemented, a client on a really bad network connection could do a resumable clone by grabbing a shallow clone of depth 1 along no paths, partition the root tree up, then extend its paths grabbing subdirectories until the root commit is fully expanded. Then it can walk back increasing its depth until it runs into the cached pack... where it can then do byte range requests. This won't be pretty. And given that the leading thin-pack for a cached pack can be less than 2% of the entire data transfer, may not be necessary for a resumable clone. IMHO if you cannot get 2% of the data transfer before your connection breaks, maybe you should ask for the data on DVD via post, because your network sucks. -- Shawn. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html