A design for subrepositories

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello.

I intend to work on a "subrepository" tool for git, but before I embark on the actual programming, I thought to first invite comments on the general design.

Some background first. I know that there are several existing approaches already for managing nested repositories, but none of them quite seems to fit my purposes. My primary goal is to use git for home directory backup and mirroring, while the home directory itself may of course contain repositories.

Git-subtree doesn't quite fit the bill. It allows merging a subtree into a larger tree and then again splitting it out for exporting, but this is tedious. More importantly, a merged tree gets branched along with the containing tree, whereas I want to have subrepositories precisely because the subtrees need to be branched independently of the container.

Submodules are a bit closer to what I want, but they have clearly been designed for a different purpose: a repository with submodules is only supposed to collate existing repositories, not act as a source for them. So they aren't really faithful to the distributed nature of git: there's no easy way to completely clone a repository and its submodules.

Moreover, submodules have some other annoyances like not supporting bare repositories and checking out the submodules in detached heads.

Now, in other circumstances I might just patch git-submodule to add the features I want, but it turns out that it is written in shell. I know that is a git tradition, but I'm going to get a bit religious here: anything longer than a screenful shouldn't be written in shell, and I'm certainly not going to add more lines to an already overlong script. Hence I'm going to write a separate tool using something a bit more... structured. Probably Python with Dulwich.

So here are some preliminary thoughts on how the tool should work.


* Repository layout

Every subrepository has a unique identifier. The heads of subrepository <subname> are simply stored as heads in a subdirectory of the main repository: e.g. refs/heads/subrepos/<subname>/<branchname>. Likewise for tags: refs/tags/subrepos/<subname>/<tagname>.

Rationale: if we had fully independent repositories under the main repository directory, like what git-submodule uses, there would be no easy way to enumerate all the existing subrepositories to copy them. Since the only thing we can directly list from a remote repository are references, it makes sense to store the subrepositories just as a bunch of them.

The reason for storing the subrepo references under refs/heads/ and refs/tags/ (instead of, say, refs/subrepos/) is simply that this way everything is directly compatible with standard git tools: one can do a normal git clone/push/pull for mirroring and backup purposes without any need for special tools. You only need tools once you operate on a working tree.


* Tree layout

A tree can mount references of subrepositories. There are two components to a mount: a gitlink under <path> to a particular commit of a subrepo, and an entry in .gitrepos. This is very similar to how git-submodule works.

The entry in .gitrepos specifies two things: the name of the subrepository mounted under <path>, and the active branch in that mount at the time of commit. So .gitrepos would look like this:

[mount "<path>"]
   subrepo = <subname>
   branch = <branchname>

Rationale: by storing the active branch name we can cater for the very common case where we check out a gitlink pointing to the current head of the branch. Then, when we check out the subrepository at the mount point, we can adjust HEAD to point to the correct branch.

By associating from a path to a subrepository (instead of the other way, as git-submodule does), we can have multiple mount points for the same subrepository, presumably with different active branches. Sometimes we want to have separate working trees for various branches, and it's good to be able to store this configuration in the containing tree.


* Working tree layout

When a tree containing mount points is checked out, a repository is created at each of those mount points. For every <path> specified in .gitrepos with subrepo <subname> and active branch <branchname>, and a gitlink in <path> pointing to <commit>, we do the following:

- Create a repository under <path>/.git

- Add the object store of the containing repository to <path>/.git/objects/info/alternates

- Pull (just copy, really) the containing repository's references to the subrepository as follows:

 - refs/heads/subrepos/<subname>/* -> refs/heads/*
 - refs/tags/subrepos/<subname>/* -> refs/tags/*
 - refs/remotes/<remote>/subrepos/<subname>/* -> refs/remotes/<remote>/*

- If now in the subrepository refs/heads/<branchname> points to <commit>, set HEAD as a symref to it. Otherwise set a detached HEAD directly to <commit>.

- Check out HEAD in the subrepository.

Rationale: it was a tempting idea to make refs/heads and refs/tags to be symlinks directly to the correct subdirectories in the containing repository, and likewise make objects/ directly a symlink to the containing repository's object store. However, this is not really feasible due to packed-refs, and it would make symlinks a requirement, something that git tries to avoid. (Of course "directory symrefs" would be a simple addition to the core.)

More importantly, a symlink to the object store would break git-gc. Also, it would be ugly to have ref manipulations under the mount point directly affect the refs in the containing repository. It's better that none of the changes under the mount point affect the containing repository in any way before an explicit add and check-in. At this point the refs are pulled back in the reverse direction.


* Basic commands


** git subrepo add <path> [<subname>]

Add a subrepository to the containing repository, or add the changes in a subrepository to the index.

If <path> is not yet found in .gitrepos, <subname> must be specified. Otherwise <subname> is looked up from .gitrepos.

The command performs the following:

- Add or update the gitlink to the index: git add <path>
- Add or change an entry in .gitmodules, setting mount.<path>.subname to <subname> and mount.<path>.branch to the active branch under <path> (if any).
- git add .gitmodules


** git subrepo checkin [-f] [<path>...]

Update the subrepo references in the containing repository to the references in the mount points. This is meant to be run as a pre-commit hook with no arguments.

If no paths are given, <path>... defaults to every mount path in .gitrepos that has been changed in the index. For each <path> mounting <subname>, perform the following:

- git fetch [-f] <path> refs/heads/*:refs/heads/subrepos/<subname>/
- git fetch [-f] <path> refs/tags/*:refs/tags/subrepos/<subname>/

If [-f] is given, it is passed to git fetch.

The operation can fail in the unlikely case that there are multiple mount points for the same subrepository, and a branch has diverged between those mount points.

Note: after this operation, any new objects that were added under the mount point are now duplicated in the containing repository. A git gc in the containing repository followed by a git gc in the mount point should remove the now-redundant objects from the mount point.

Note: the default paths overlook the spurious case where have modified the head of a non-active branch under the mount point, but the active branch (and hence the commit in the gitlink) have remained unchanged. I don't know if there's a reasonable way to make "git subrepo add" somehow stage even these kinds of changes.


** git subrepo checkout [<path>...]

Check out the subrepositories at mount points <path>..., or at all the mount points if none are specified. This is meant to be run as a post-checkout hook with no arguments.

This is described above in "Working tree layout". If this is not an initial checkout, then the first two steps are skipped and just the refs and working tree are updated.


** git subrepo mv <path> <path>

Move a mount point: git mv the actual directory and adjust the path in .gitrepos and possibly the relative path in <path>/.git/objects/info/alternates. (An absolute path would fix the latter, but then we couldn't move the entire containing repository. This is the lesser evil, IMHO.)

Gripe: why doesn't git support arbitrary metadata for tree entries? Then we wouldn't need to worry about syncing various path attributes that are stored in separate files, but a simple git mv could automatically move everything associated with the path.


** git subrepo rm <path>

Remove the mount point and its entry in .gitrepos.


* A variant design

The above design is straightforward to implement, but it has a bit of an ad-hoc feel in that we have these magic commands that transfer refs between the containing repository and the mount points. But there are already standard tools for transferring refs: push and pull/fetch. It would be more "git-like" to use these directly, and make the containing repository be simply a remote for the mount point. We need a special remote for this purpose: git-remote-subrepo gives a "view" of the refs of a particular subrepo within the ref tree of the containing repository. It just makes the following translations for push and fetch:

subrepo://<URL>/<subname> refs/heads/<branchname>
-> <URL> heads/subrepos/<subname>/<branchname>

subrepo://<URL>/<subname> refs/tags/<tagname>
-> <URL> tags/subrepos/<subname>/<tagname>

subrepo://<URL>/<subname>/<remote> refs/heads/<branchname> ->
-> <URL> remotes/<remote>/heads/subrepos/<subname>/<branchname>

subrepo://<URL>/<subname>/<remote> refs/heads/<branchname> ->
-> <URL> remotes/<remote>/heads/subrepos/<subname>/<branchname>

Then subrepo://<containingrepo>/<subname> is set as the origin in the mount point, so one can just do a normal git push to push the changes to the containing repository. Likewise, for all the remotes in the containing repository, a remote with the same name is created under the mount point with the url subrepo://<containingrepo>/<subname>/<remote>. Or it can be set to directly access the actual remote: subrepo://<url-of-remote>/<subname>. It's a matter of taste.

The problem with explicit pushing to the containing repository is that then changes to the refs happen completely independently of changes to the gitlinks, and ideally these should be synchronized in a single commit. So I'm not quite sure if the additional complexity of a remote helper is warranted.


I hope I managed to make some sense of what this is about. Questions and comments are appreciated.

Cheers,


Lauri

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]