[RFC PATCH 01/15] README-sparse-clone: Add a basic writeup of my ideas for sparse clones

Elijah Newren <newren@xxxxxxxxx> · Sat, 4 Sep 2010 18:13:53 -0600

This write-up just has basic ideas, strategies, notes of what needs to be
done, etc.  It needs to be pruned, cleaned up, corrected as I learn more,
moved elsewhere, etc.

Signed-off-by: Elijah Newren <newren@xxxxxxxxx>
---
 README-sparse-clone |  283 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 283 insertions(+), 0 deletions(-)
 create mode 100644 README-sparse-clone

diff --git a/README-sparse-clone b/README-sparse-clone
new file mode 100644
index 0000000..cfeeef3
--- /dev/null
+++ b/README-sparse-clone
@@ -0,0 +1,283 @@
+This is my set of notes on implementing sparse clones, which I define
+as a clone where not all blob, tree, or commit objects are downloaded.
+This includes sparseness both relative to span of directories and
+depth of history.
+
+(Note: This project has work-in-progress patches -- no promises about
+quality, speed of implementation, promises not to rebase, etc. etc.)
+
+*** Summary ***
+
+  Basic Idea:
+    0) Only relevant blobs, trees, and commits (+ ancestry) are downloaded.
+  User View:
+    U1) A user controls sparseness by passing rev-list arguments to clone.
+    U2) "Densifying" a sparse clone can be done (with new rev-list arguments)
+    U3) Cloning-from/fetching-from/pushing-to sparse clones is supported.
+    U4) Operations that need unavailable data simply error out
+    U5) Old style shallow clones (--depth argument to clone) are obsolete
+    U6) Miscellaneous notes
+  Internals:
+    I1) The limiting rev-list arguments passed to clone are stored.
+    I2) All revision-walking operations automatically use the limiting args.
+    I3) The index only contains paths matching the sparse limits
+    I4) Loading a missing commit results in a fake commit being created
+    I5) In sparse clones, a special merge strategy must be used
+    I6) Miscellaneous notes
+
+*** Basic Idea ***
+
+0) Only relevant blobs, trees, and commits (+ ancestry) are downloaded.
+
+Only the relevant blobs, trees, and commits are downloaded.
+Irrelevant blobs and trees are left out entirely (see items I2 & I3
+for how we avoid accessing these).
+
+To ensure minimum necessary connectivity, we also download basic
+information from otherwise excluded commits
+  * parents of these commits
+  * trees matching the specified sparse path(s)
+but, for security and space reasons, do not download
+  * author
+  * author date
+  * committer
+  * committer date
+  * log message
+Such commits are still considered "missing" (see item I4 for more
+details about how we handle "missing" commits).
+
+Tags/branches are downloaded if specified (or, if no branch/tag is
+specified, all tags/branches are downloaded).
+
+Security note: No modifications are done to existing trees, meaning
+that sparse clones will download the name of "irrelevant" blobs/trees
+with their type, mode, and sha1sum if (and only if) such blobs/trees
+are siblings of a relevant blob/tree.  It is assumed that such
+information is okay to be transmitted and need not remain private; if
+such information does need to remain private, an alternate mechanism
+involving rewriting commits will be necessary (such as git-subtree).
+
+*** User View ***
+
+U1) A user controls sparseness by passing rev-list arguments to clone.
+
+This allows a user to control sparseness both in terms of span of
+content (files/directories) and depth of history.  It can also be used
+to limit to a subset of refs (cloning just one or two branches instead
+of all branches and tags).  For example,
+  $ git clone ssh://repo.git dst -- Documentation/
+  $ git clone ssh://repo.git dst master~6..master
+  $ git clone ssh://repo.git dst -3
+(Note that the destination argument becomes mandatory for those doing
+a sparse clone in order to disambiguate it from rev-list options.)
+
+This method also means users don't need much training to learn how to
+use sparse clones -- they just use syntax they've already learned with
+log, and clone will pass this info on to upload-pack.
+
+There is a difference due to inclusive revision specifications
+(master, master~6, v4.15.6) vs. exclusive ones (-3, ^master,
+^master~6).  Inclusive revisions must be branch or tag names
+(e.g. stable or v1.8, but not master~6 or v4.18.2~1 or sha1sum or
+:/<search string>)[1].  "HEAD --all"
+are assumed if no inclusive revisions are specified.  (Note: Avery
+seems to suggest always assuming "HEAD --all", at least at first.)
+
+[1] This limitation on inclusive revisions could be relaxed in the
+future for specifications derived from branch names, as long as each
+branch has no more than one associated derived revision specification.
+For example, master~6 would mean to clone a copy of the master branch
+on the remote side, excluding the last 6 commits, so that you start
+out "6 commits behind" the remote.  Obviously, it wouldn't make sense
+to have both "master^1" and "master^2" specified, since we then
+wouldn't know where master should point in the clone.
+
+U2) "Densifying" a sparse clone can be done (with new rev-list arguments)
+
+One can fetch a new pack, replace the original limiting rev-list args
+with the new choice (see item I1), and update the working copy to
+reflect the changes.  As users wouldn't expect a "fetch" or a "merge"
+to un-sparsify a checkout, there's a special operation for performing
+all three operations.
+
+[First cut will be to just redownload everything, instead of just the
+necessary data.  I'm thinking it won't be a common operation, and it
+could always be improved later.]
+
+U3) Cloning-from/fetching-from/pushing-to sparse clones is supported.
+
+This allows people who need to operate on a subset of the repository
+(e.g. translators, technical writers, etc.) to collaborate on that
+subset.  I think one simple rule should enable this:
+
+  * The receiving repository specifies the limiting rev-list arguments
+    to use (if the sending repository does not have the relevant data,
+    it will naturally error out)
+
+By having the receving side specify the limiting rev-list arguments,
+it ensures that any data it receives fulfills its needs.  The sending
+side then uses this information when creating a pack to determine the
+necessary objects to send, ignoring anything outside the paths/ranges
+specified in those limits.  If the sending side is a sparse clone that
+does not have the necessary data specified by the receiver, then
+pack-objects will hit a nasty low-level missing object error, aborting
+the operation.  In the future, we could maybe add a nicer error
+message.
+
+One special case:
+  * When cloning a repository, if the user did not specify any
+    limiting rev-list arguments, use those from the repository being
+    cloned.  (Don't require the user to type out all the paths every
+    time; e.g. 'git clone URL DEST -- PATH1 PATH2 PATH3 PATH4...')
+
+U4) Operations that need unavailable data simply error out
+
+Although no normal git command should be disabled entirely, there will
+be cases when some git commands cannot function without more data.
+
+Examples:
+  * merge, cherry-pick, rebase (if unavailable files needed)
+  * upload-pack (if more data requested than available in a sparse clone)
+
+Merge, cherry-pick, and rebase deserve special consideration to
+operate in sparse clones (see item I5), since merge strategies
+normally require full trees.
+
+U5) Old style shallow clones (--depth argument to clone) are obsolete
+
+Since one can pass "-3" to get a "shallow" clone, old-style shallow
+clones are obsolete.  New style shallow/sparse clones will also be
+more capable, since one can
+  * exclude based on commit (e.g. ^master~10) in addition to depth
+  * clone/push/pull from/to shallow clones
+
+What to do with old style shallow clones?  Probably deprecate them,
+make the --depth argument to clone print an error message suggesting
+the new syntax, and then gut the related code at some point in the
+future.
+
+U6) Miscellaneous notes
+  * fsck & status should print a notice when working on a sparse clone
+  * paths in limiting rev-list args *must* follow '--' (current or
+    future remote repo may be bare, meaning setup_revisions will
+    complain about nonexistent paths specified without a preceding
+    '--').  Having all paths folow a '--' will also make it easier to
+    find them and pass them on to diff machinery (see item I2).
+  * notes hierarchy may also need to be made sparse in a way that only
+    notes pointing downloaded objects should be downloaded.  This
+    implies missing blobs/trees, and maybe even "missing" commits.
+    But how do I avoid traversing the wrong notes on the client side?
+    Ouch.  Maybe just include all notes?  Or exclude all notes?
+
+*** Internals ***
+
+I1) The limiting rev-list arguments passed to clone are stored.
+
+However, relative arguments such as "-3" or "^master~6" first need to
+be translated into one or more exclude ranges written as "^<sha1>".
+
+I2) All revision-walking operations automatically use the limiting args.
+
+This should be a simple code change, and would enable rev-list, log,
+diff (which also uses the revision walking machinery), etc. to avoid
+missing blobs/trees/commits and thus enable them to work with sparse
+clones.  fsck would take a bit more work, since it doesn't use the
+setup_revisions() and revision.h walking machinery, but shouldn't be
+too bad (I hope).
+
+Also, the pathspecs (or the diff options they generate) are available
+easily for operations that need them (see I3).
+
+I3) The index only contains paths matching the sparse limits
+
+Since not all trees are downloaded, not all files can even be
+referenced in the index.  Further, in some cases, the only thing that
+can be referenced is a tree rather than a file.  We only want paths
+matching the relevant sparse limits to be included in the index.  This
+means two things:
+  * When extracting entries from trees into the index, the sparse limits
+    need to be taken into consideration
+  * Whenever writing trees, using the index is no longer sufficient.
+    Instead, the files in the index are used to record
+    sha1sums/modes/filenames for paths within the sparse limits, and
+    another tree (typically from HEAD) is used to record
+    sha1sums/modes/filenames/types for paths outside the sparse
+    limits.
+
+Note that writing trees from the index can occur with commit, merge,
+checkout (-m), revert/cherry-pick --no-commit, and write-tree.  All
+need to be updated to either provide a relevant tree or error out when
+run from a sparse clone.
+
+I4) Loading a missing commit results in a fake commit being created
+
+Fake commits have correct parentage and an appropriate (sparse) tree
+(since those pieces of information are available), but blank author &
+committer, 0 for times & timezones, and a commit log message such as
+the following:
+  This commit is missing from this sparse clone.  You can use the
+  densify command to download missing commits and files.
+
+This allows the following to work:
+  * git commit (which needs tree/file sha1sums that were not modified,
+    though if a given tree is unmodified, no subtree/subfile sha1s are
+    needed)
+  * tags & branches (which can correctly point at missing commits)
+  * git show (with a branch/tag/commit)
+  * git prune (missing objects correctly reference their parent(s))
+  * git fsck (missing commits still referenced)
+
+Extra notes:
+  * Stored in a file using multiple lines of: <commit> <tree> <parent1> ...
+  * Only referenced when git would otherwise die
+
+I5) In sparse clones, a special merge strategy must be used
+
+Most merge strategies work at the file/content level.  Since many
+files and even whole trees will be unavailable, a special strategy
+that works with tree-level items is necessary.  It should only perform
+trivial merges when forced to operate at the tree-level (modified on
+at most one side of history, and probably no rename handling at least
+at first).  When such trivial merges are not possible, it should fail
+with a helpful error message noting the needed tree contents.
+
+For non-missing blobs, standard merge strategies may be used.
+
+I6) Miscellaneous notes
+  * thin-packs: git pack-objects needs to be told to only delta
+    against objects that match the sparse limits, otherwise the
+    receiving side will not be able to use the resulting pack.
+
+----------------------------------------------------------------------
+
+Testcases needed:
+  * basics:        checkout, status, diff, log (w/ options!), add, commit
+  * extras:        blame, apply, bisect, branch, tag, grep, reset
+  * maintainence:  fsck, prune, gc/repack, verify-pack
+  * plumbing:      {read,write,ls,commit,merge,tar,diff}-tree, mktree
+  * direct:        cat-file, show (esp. missing obj. or tag/branch of such)
+  * merge strat.:  merge, cherry-pick/revert, rebase
+  * communication: pull, push, fetch, clone, bundle, archive
+  * protocols:     http, ssh, git, rsync
+  * rewrite:       filter-branch, fast-{export, import}
+  * notes:         ?
+
+  General:
+    'clone NON-BARE-REPO dst PATHS' should fail (needs double dash)!
+    git rev-list master should show subset of available commits
+  Keep Index sparse:
+    git add <path> for <path> not in git_sparse_pathspec should error out
+    update-index on <path> not in git_sparse_pathspec should error out
+  Sparse Index Handling:
+    merge into branch yet to be born, revert
+    checkout -m  (to real branch, from valid or yet-to-be born branch)
+
+  Major TODOs:
+    * fetch
+    * push
+    * don't pass revlist arguments on command line to upload pack; use protocol
+    * densify command
+    * missing commits
+    * fix thin packs to only delta against objects within sparse limits
+    * lots more testcases
+    * cleanup FIXMEs
-- 
1.7.2.2.140.gd06af

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html