Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree packing)

Elijah Newren <newren@xxxxxxxxx> · Tue, 27 Jul 2010 18:13:09 -0600

Hi,

2010/7/27 Shawn O. Pearce <spearce@xxxxxxxxxxx>:
> I would prefer doing something more like what we do with shallow
> on the client side.  Record in a magic file the path(s) that we
> did actually obtain.  During fsck, rev-list, or read-tree the
> client skips over any paths that don't match that file's listing.
> Then we can keep the same commit SHA-1s, but we won't complain that
> there are objects missing.

I recently decided to take a crack at implementing sparse clones, due
to a crazy idea I had (which might not be as crazy as I thought since
you suggest something similar, though more limited).  I was going to
wait until I actually got somewhere tangible with it to post an RFC,
particularly since it may take me a while, but since it's fresh on
everyone's minds perhaps now is good anyway.

Does the following seem sane, or are there big gotchas that I'm just unaware of?

0) Sparse clones have "all" commit objects, but not all trees/blobs.

Note that "all" only means all that are reachable from the refs being
downloaded, of course.  I think this is widely agreed upon and has
been suggested many times on this list.

1) A user controls sparseness by passing rev-list arguments to clone.

This allows a user to control sparseness both in terms of span of
content (files/directories) and depth of history.  It can also be used
to limit to a subset of refs (cloning just one or two branches instead
of all branches and tags).  For example,
  $ git clone ssh://repo.git dst -- Documentation/
  $ git clone ssh://repo.git dst master~6..master
  $ git clone ssh://repo.git dst -3
(Note that the destination argument becomes mandatory for those doing
a sparse clone in order to disambiguate it from rev-list options.)

This method also means users don't need much training to learn how to
use sparse clones -- they just use syntax they've already learned with
log, and clone will pass this info on to upload-pack.

There is a slight question as to whether users should have to specify
"--all HEAD" with all sparse clones or whether it should be assumed
when no other refs are listed.

2) Sparse checkouts are automatically invoked with the path(s) from
   the specified rev-list arguments.

Can't checkout content that we don't have.  :-)

This has a slight downside -- it makes sparse checkouts and sparse
clones slight misfits: the syntax (.gitignore style vs. rev-list
arguments) is a bit different, and sparse checkouts can exclude
certain paths whereas my sparse clones would only be able to *include*
paths.  I don't see this as a deal-breaker, but even if others
disagree I think a more general path-exclusion mechanism for the
revision walking machinery would be really nice for reasons beyond
just this one.  I've often wanted to do something like
  git log -S'important code phrase' --EXCLUDE-PATH=big-data-dir

3) The limiting rev-list arguments passed to clone are stored.

However, relative arguments such as "-3" or "master~6" first need to
be translated into one or more exclude ranges written as "^<sha1>".

4) All revision-walking operations automatically use these limiting args.

This should be a simple code change, and would enable rev-list, log,
etc. to avoid missing blobs/trees and thus enable them to work with
sparse clones.  fsck would take a bit more work, since it doesn't use
the setup_revisions() and revision.h walking machinery, but shouldn't
be too bad (I hope).

There are also performance ramifications: There should be no
measurable performance overhead for non-sparse clones (something that
might be a problem with a different implementation that did
does-this-exist check each time it references a blob).  It should also
be a significant performance boost for those using it, as operations
will only need to deal with the subset of the repository they specify
(faster downloads, stats, logs, etc.)

5) "Densifying" a sparse clone can be done

One can fetch a new pack and replace the limiting rev-list args with
the new choice.  The sparse checkout information needs to be updated
too.

(So users probably would want to densify a sparse clone with "pull"
rather than "fetch", as manually updating sparse checkouts may be a
bit of a hassle.)

6) Cloning-from/fetching-from/pushing-to sparse clones is supported.

Future fetches and pushes also make use of the limiting arguments.
Receives do as well, but only to make sure the pack obtained is not
"more sparse" than what the receiving repository already has.
(uploads ignore the stored rev-list arguments, instead using the
rev-list arguments passed to it -- it will die if asked for content
not locally available to it.)

7) Operations that need unavailable data simply error out

Examples: merge, cherry-pick, rebase (and upload-pack in a sparse
clone).  However, hopefully the error messages state what extra
information needs to be downloaded so the user can appropriately
"densify" their repository.

Thanks,
Elijah
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html