Hello.
I intend to work on a "subrepository" tool for git, but before I
embark on the actual programming, I thought to first invite comments
on the general design.
Some background first. I know that there are several existing
approaches already for managing nested repositories, but none of them
quite seems to fit my purposes. My primary goal is to use git for home
directory backup and mirroring, while the home directory itself may of
course contain repositories.
Git-subtree doesn't quite fit the bill. It allows merging a subtree
into a larger tree and then again splitting it out for exporting, but
this is tedious. More importantly, a merged tree gets branched along
with the containing tree, whereas I want to have subrepositories
precisely because the subtrees need to be branched independently of
the container.
Submodules are a bit closer to what I want, but they have clearly been
designed for a different purpose: a repository with submodules is only
supposed to collate existing repositories, not act as a source for
them. So they aren't really faithful to the distributed nature of git:
there's no easy way to completely clone a repository and its submodules.
Moreover, submodules have some other annoyances like not supporting
bare repositories and checking out the submodules in detached heads.
Now, in other circumstances I might just patch git-submodule to add
the features I want, but it turns out that it is written in shell. I
know that is a git tradition, but I'm going to get a bit religious
here: anything longer than a screenful shouldn't be written in shell,
and I'm certainly not going to add more lines to an already overlong
script. Hence I'm going to write a separate tool using something a bit
more... structured. Probably Python with Dulwich.
So here are some preliminary thoughts on how the tool should work.
* Repository layout
Every subrepository has a unique identifier. The heads of
subrepository <subname> are simply stored as heads in a subdirectory
of the main repository: e.g.
refs/heads/subrepos/<subname>/<branchname>. Likewise for tags:
refs/tags/subrepos/<subname>/<tagname>.
Rationale: if we had fully independent repositories under the main
repository directory, like what git-submodule uses, there would be no
easy way to enumerate all the existing subrepositories to copy them.
Since the only thing we can directly list from a remote repository are
references, it makes sense to store the subrepositories just as a
bunch of them.
The reason for storing the subrepo references under refs/heads/ and
refs/tags/ (instead of, say, refs/subrepos/) is simply that this way
everything is directly compatible with standard git tools: one can do
a normal git clone/push/pull for mirroring and backup purposes without
any need for special tools. You only need tools once you operate on a
working tree.
* Tree layout
A tree can mount references of subrepositories. There are two
components to a mount: a gitlink under <path> to a particular commit
of a subrepo, and an entry in .gitrepos. This is very similar to how
git-submodule works.
The entry in .gitrepos specifies two things: the name of the
subrepository mounted under <path>, and the active branch in that
mount at the time of commit. So .gitrepos would look like this:
[mount "<path>"]
subrepo = <subname>
branch = <branchname>
Rationale: by storing the active branch name we can cater for the very
common case where we check out a gitlink pointing to the current head
of the branch. Then, when we check out the subrepository at the mount
point, we can adjust HEAD to point to the correct branch.
By associating from a path to a subrepository (instead of the other
way, as git-submodule does), we can have multiple mount points for the
same subrepository, presumably with different active branches.
Sometimes we want to have separate working trees for various branches,
and it's good to be able to store this configuration in the containing
tree.
* Working tree layout
When a tree containing mount points is checked out, a repository is
created at each of those mount points. For every <path> specified in
.gitrepos with subrepo <subname> and active branch <branchname>, and a
gitlink in <path> pointing to <commit>, we do the following:
- Create a repository under <path>/.git
- Add the object store of the containing repository to
<path>/.git/objects/info/alternates
- Pull (just copy, really) the containing repository's references to
the subrepository as follows:
- refs/heads/subrepos/<subname>/* -> refs/heads/*
- refs/tags/subrepos/<subname>/* -> refs/tags/*
- refs/remotes/<remote>/subrepos/<subname>/* -> refs/remotes/<remote>/*
- If now in the subrepository refs/heads/<branchname> points to
<commit>, set HEAD as a symref to it. Otherwise set a detached HEAD
directly to <commit>.
- Check out HEAD in the subrepository.
Rationale: it was a tempting idea to make refs/heads and refs/tags to
be symlinks directly to the correct subdirectories in the containing
repository, and likewise make objects/ directly a symlink to the
containing repository's object store. However, this is not really
feasible due to packed-refs, and it would make symlinks a requirement,
something that git tries to avoid. (Of course "directory symrefs"
would be a simple addition to the core.)
More importantly, a symlink to the object store would break git-gc.
Also, it would be ugly to have ref manipulations under the mount point
directly affect the refs in the containing repository. It's better
that none of the changes under the mount point affect the containing
repository in any way before an explicit add and check-in. At this
point the refs are pulled back in the reverse direction.
* Basic commands
** git subrepo add <path> [<subname>]
Add a subrepository to the containing repository, or add the changes
in a subrepository to the index.
If <path> is not yet found in .gitrepos, <subname> must be specified.
Otherwise <subname> is looked up from .gitrepos.
The command performs the following:
- Add or update the gitlink to the index: git add <path>
- Add or change an entry in .gitmodules, setting mount.<path>.subname
to <subname> and mount.<path>.branch to the active branch under <path>
(if any).
- git add .gitmodules
** git subrepo checkin [-f] [<path>...]
Update the subrepo references in the containing repository to the
references in the mount points. This is meant to be run as a
pre-commit hook with no arguments.
If no paths are given, <path>... defaults to every mount path in
.gitrepos that has been changed in the index. For each <path> mounting
<subname>, perform the following:
- git fetch [-f] <path> refs/heads/*:refs/heads/subrepos/<subname>/
- git fetch [-f] <path> refs/tags/*:refs/tags/subrepos/<subname>/
If [-f] is given, it is passed to git fetch.
The operation can fail in the unlikely case that there are multiple
mount points for the same subrepository, and a branch has diverged
between those mount points.
Note: after this operation, any new objects that were added under the
mount point are now duplicated in the containing repository. A git gc
in the containing repository followed by a git gc in the mount point
should remove the now-redundant objects from the mount point.
Note: the default paths overlook the spurious case where have modified
the head of a non-active branch under the mount point, but the active
branch (and hence the commit in the gitlink) have remained unchanged.
I don't know if there's a reasonable way to make "git subrepo add"
somehow stage even these kinds of changes.
** git subrepo checkout [<path>...]
Check out the subrepositories at mount points <path>..., or at all the
mount points if none are specified. This is meant to be run as a
post-checkout hook with no arguments.
This is described above in "Working tree layout". If this is not an
initial checkout, then the first two steps are skipped and just the
refs and working tree are updated.
** git subrepo mv <path> <path>
Move a mount point: git mv the actual directory and adjust the path in
.gitrepos and possibly the relative path in
<path>/.git/objects/info/alternates. (An absolute path would fix the
latter, but then we couldn't move the entire containing repository.
This is the lesser evil, IMHO.)
Gripe: why doesn't git support arbitrary metadata for tree entries?
Then we wouldn't need to worry about syncing various path attributes
that are stored in separate files, but a simple git mv could
automatically move everything associated with the path.
** git subrepo rm <path>
Remove the mount point and its entry in .gitrepos.
* A variant design
The above design is straightforward to implement, but it has a bit of
an ad-hoc feel in that we have these magic commands that transfer refs
between the containing repository and the mount points. But there are
already standard tools for transferring refs: push and pull/fetch. It
would be more "git-like" to use these directly, and make the
containing repository be simply a remote for the mount point. We need
a special remote for this purpose: git-remote-subrepo gives a "view"
of the refs of a particular subrepo within the ref tree of the
containing repository. It just makes the following translations for
push and fetch:
subrepo://<URL>/<subname> refs/heads/<branchname>
-> <URL> heads/subrepos/<subname>/<branchname>
subrepo://<URL>/<subname> refs/tags/<tagname>
-> <URL> tags/subrepos/<subname>/<tagname>
subrepo://<URL>/<subname>/<remote> refs/heads/<branchname> ->
-> <URL> remotes/<remote>/heads/subrepos/<subname>/<branchname>
subrepo://<URL>/<subname>/<remote> refs/heads/<branchname> ->
-> <URL> remotes/<remote>/heads/subrepos/<subname>/<branchname>
Then subrepo://<containingrepo>/<subname> is set as the origin in the
mount point, so one can just do a normal git push to push the changes
to the containing repository. Likewise, for all the remotes in the
containing repository, a remote with the same name is created under
the mount point with the url
subrepo://<containingrepo>/<subname>/<remote>. Or it can be set to
directly access the actual remote:
subrepo://<url-of-remote>/<subname>. It's a matter of taste.
The problem with explicit pushing to the containing repository is that
then changes to the refs happen completely independently of changes to
the gitlinks, and ideally these should be synchronized in a single
commit. So I'm not quite sure if the additional complexity of a remote
helper is warranted.
I hope I managed to make some sense of what this is about. Questions
and comments are appreciated.
Cheers,
Lauri
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html