Hi, 2010/7/27 Shawn O. Pearce <spearce@xxxxxxxxxxx>: > I would prefer doing something more like what we do with shallow > on the client side. Record in a magic file the path(s) that we > did actually obtain. During fsck, rev-list, or read-tree the > client skips over any paths that don't match that file's listing. > Then we can keep the same commit SHA-1s, but we won't complain that > there are objects missing. I recently decided to take a crack at implementing sparse clones, due to a crazy idea I had (which might not be as crazy as I thought since you suggest something similar, though more limited). I was going to wait until I actually got somewhere tangible with it to post an RFC, particularly since it may take me a while, but since it's fresh on everyone's minds perhaps now is good anyway. Does the following seem sane, or are there big gotchas that I'm just unaware of? 0) Sparse clones have "all" commit objects, but not all trees/blobs. Note that "all" only means all that are reachable from the refs being downloaded, of course. I think this is widely agreed upon and has been suggested many times on this list. 1) A user controls sparseness by passing rev-list arguments to clone. This allows a user to control sparseness both in terms of span of content (files/directories) and depth of history. It can also be used to limit to a subset of refs (cloning just one or two branches instead of all branches and tags). For example, $ git clone ssh://repo.git dst -- Documentation/ $ git clone ssh://repo.git dst master~6..master $ git clone ssh://repo.git dst -3 (Note that the destination argument becomes mandatory for those doing a sparse clone in order to disambiguate it from rev-list options.) This method also means users don't need much training to learn how to use sparse clones -- they just use syntax they've already learned with log, and clone will pass this info on to upload-pack. There is a slight question as to whether users should have to specify "--all HEAD" with all sparse clones or whether it should be assumed when no other refs are listed. 2) Sparse checkouts are automatically invoked with the path(s) from the specified rev-list arguments. Can't checkout content that we don't have. :-) This has a slight downside -- it makes sparse checkouts and sparse clones slight misfits: the syntax (.gitignore style vs. rev-list arguments) is a bit different, and sparse checkouts can exclude certain paths whereas my sparse clones would only be able to *include* paths. I don't see this as a deal-breaker, but even if others disagree I think a more general path-exclusion mechanism for the revision walking machinery would be really nice for reasons beyond just this one. I've often wanted to do something like git log -S'important code phrase' --EXCLUDE-PATH=big-data-dir 3) The limiting rev-list arguments passed to clone are stored. However, relative arguments such as "-3" or "master~6" first need to be translated into one or more exclude ranges written as "^<sha1>". 4) All revision-walking operations automatically use these limiting args. This should be a simple code change, and would enable rev-list, log, etc. to avoid missing blobs/trees and thus enable them to work with sparse clones. fsck would take a bit more work, since it doesn't use the setup_revisions() and revision.h walking machinery, but shouldn't be too bad (I hope). There are also performance ramifications: There should be no measurable performance overhead for non-sparse clones (something that might be a problem with a different implementation that did does-this-exist check each time it references a blob). It should also be a significant performance boost for those using it, as operations will only need to deal with the subset of the repository they specify (faster downloads, stats, logs, etc.) 5) "Densifying" a sparse clone can be done One can fetch a new pack and replace the limiting rev-list args with the new choice. The sparse checkout information needs to be updated too. (So users probably would want to densify a sparse clone with "pull" rather than "fetch", as manually updating sparse checkouts may be a bit of a hassle.) 6) Cloning-from/fetching-from/pushing-to sparse clones is supported. Future fetches and pushes also make use of the limiting arguments. Receives do as well, but only to make sure the pack obtained is not "more sparse" than what the receiving repository already has. (uploads ignore the stored rev-list arguments, instead using the rev-list arguments passed to it -- it will die if asked for content not locally available to it.) 7) Operations that need unavailable data simply error out Examples: merge, cherry-pick, rebase (and upload-pack in a sparse clone). However, hopefully the error messages state what extra information needs to be downloaded so the user can appropriately "densify" their repository. Thanks, Elijah -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html