(I wrote most of this a couple of days ago, so it's not at the tip of the conversational tree, so to speak. But it's effectively a response to Linus's "what do you want to do with submodules" question, with some thoughts on implementation. Sorry it's so long; like Blaise Pascal, "I would have written a shorter letter, but I did not have the time.") The supermodule concept, implemented right, could really improve cooperation among embedded platform integrators, boutique distro publishers, and other editorial contributors to sprawling metaprojects who don't want to run kernel.org-scale mirrors. To make this work, you need sparse repositories (conserving resources when fetching, by omitting the bulk of currently un-needed submodules that can reliably be obtained later from elsewhere) and shallow cloning (conserving resources when publishing, by referring cloners to a third-party repository for universally available content). For instance, it would be a wonderful thing if the pile-o-patches nightmare that is PTXdist (and crosstool and buildtool and every other approach I have seen for ongoing maintenance of embedded toolchains and userlands) were obsoleted by a git supermodule. Its submodules would mostly track external projects, but would also logically contain the fix-up patches worked out during platform integration, checked in to branches anchored at each upstream release point. The supermodule would contain all of the build automation, log auditing, and remote unit testing stuff, as well as the metadata for each submodule involved in this platform build cycle. At a content level, the sparsely populated / shallowly published supermodule wouldn't be much different from today's PTXdist. But the pay-off comes when you merge forward to a new release of some base component (compiler, library, etc.) and discover that some of your fix-ups have been adopted or obsoleted upstream, and new fix-ups are needed for components that depend on the updated bit, and the set of configurables has changed (for which you need to compensate in the meta-configurator). Instead of piling up versioned patch directories, you commit fix-ups to the sub-modules, which other integration branches can ignore (if they aren't affected), merge, or cherry-pick. As I understand it, in today's git, every content object is a patch to the _data_ of one and only one git repository, containing the label of the preceding _data_ state plus a diff of file contents and attributes. Assuming this model is retained, any clean state of a "leaf" module (one with no submodules) can be reached by replaying a series of patches, starting from the repository's root node (an empty directory with the hopefully unique label generated by init-db). The label (SHA1) of the last patch is therefore a perfectly good label for this _data_ state. If all we were trying to do with supermodules was to capture and track various states of the submodules' data, we could extend the format of content objects to include "state X of submodule with init-db label Y". That would have the effect of capturing submodule states as _data_ in non-"leaf" modules. We would have to help cloners find a place from which to pull these states, of course; and it's easy to get sidetracked onto that part of the problem. But that's not where the bang for the buck is in supermodules. The whole model of distributed supermodules, with references to slightly diverging submodules whose content should mostly be fetched from external sources, smells to me just like LVM. The external sources (like an LVM volume of which you have taken a "snapshot") make up the bulk of the content pool. They also give you a window into developments on the submodule's own branches (like being able to peek forward and merge changes from the original volume). The supermodule (the snapshot volume) provides most of the interesting refs (submodule commits referenced by supermodule tags and branch heads), along with enough "journaled" content to replay forward from some checkpoint guaranteed to be available in each external source to any of these refs. The implication here is that submodule states are not just SHA1 labels to be embedded within supermodule data diffs. One ought to be able to clone a supermodule without immediately cloning full copies of any of its submodules. This ought to populate the clone's content database with all of the quanta of submodule content that aren't guaranteed to be available from any not-too-stale submodule mirror. When cloning, you don't want to have to inspect every supermodule state for submodule states that are outside the global subset. So the supermodule needs to maintain a set of supplemental refs from which all referenced submodule states can be reached. This allows you to traverse the portion of the pool of submodule content that can't be reached from true submodule branch heads. On 12/1/06, Linus Torvalds <torvalds@xxxxxxxx> wrote:
Yes, you do need to have a list of submodules somewhere, and you'd need to maintain that separately. One of the results of having the submodules be independent from the supermodule is that it's not all "automatically integrated", and thus the supermodule does end up having to have things like that maintained separately.
This is not a defect; it's a virtue. It's important for every commit to the supermodule to contain the information of which submodule branches you're currently on and how far along them you've crawled. Any particular supermodule commit point is likely to reflect an integration milestone visible only to the person working at the supermodule level. No content object should ever cross a submodule boundary, because then you wouldn't be able to apply it to the submodule in isolation (or in another supermodule state) or identify it when it is applied upstream and propagates back to you in a pull. But the supermodule can also contain supplemental refs (heads and tags) that don't exist in the submodule (and shouldn't necessarily be pushed to it); the commits they refer to are localized to the submodule but may not be reachable from any of the submodule's branch heads.
And yes, if you screw that up, you wouldn't be able to fetch submodules properly etc, even if you see the supermodule, and yes, this sounds more like the CVS "Entries" kind of file that is more "tacked on" than really deeply integrated. But I think the separation is _more_ than worth the fact that you can see things being separate.
There is an opportunity for useful deep integration here. The same algorithm that does reachability analysis for "git prune" can dig from supermodule down to submodules, copying objects into the supermodule database until it hits a commit that is advertised as "global" by the submodule. "git clone" of the supermodule can then pull the bulk of the submodules (a superset of the "global" subset) from (a mirror of) the canonical place for each, and use the supermodule object database as an alternate source for commits that don't exist in the "canonical" submodule.
In fact, I'm very much arguing for keeping things as separate as possible, while just integrating to the smallest possible degree (just _barely_ enough that you can do things like "git clone" and it will fetch multiple repositories and put them all in the right places, and "git diff" and friends will do reasonably sane things). Keep it simple, stupid.
As simple as possible; but no simpler. The "alternates" / "git clone --reference" model is already almost powerful enough for the supermodule to contain a "journal" of submodule commits that haven't yet been retired to the canonical subset (guaranteed present in each mirror). The only difference is that the supermodule should be considered a "weak alternates" source. Commit objects in the supermodule's database should be visible to submodule-level operations (so that commits which are accepted upstream get flowed in nicely during "git pull"). But if a commit becomes reachable from a ref that is really in the submodule (not just one of the supermodule's "supplemental refs", which should _not_ be visible to submodule operations), then it should be copied into the submodule's object database. (The refs internal to the submodule should retain their integrity even if the supermodule is inaccessible.) The existing "strong alternates" mechanism should be reserved for repos which are at least as public and persistent as the submodule, and supermodules don't qualify (e. g., Linus's transmeta scenario).
On Sat, 2 Dec 2006, Josef Weidendorfer wrote: > The thing I wanted to discuss is whether such names would need to be globally > unique in the project containing submodles, or not. My preference would be for it to be "local", just because (as I mentioned), with mirroring etc, it might well be that you want to fetch things from the _closest_ repository. That's really not a global decision, it's a local one.
I think "global resource, local provider" is the way to go, with each provider advertising what checkpoints of what resources it can supply. When I clone or pull, I should be able to consult a local mapping of submodule URIs to "mirrors" (which may well be local repositories containing content and branches that aren't in the "official" upstream). The only thing that may need "global" agreement is the boundaries of the "global" subset for each submodule, i. e., the set of commit objects that can reliably be obtained from any mirror of the "official" upstream repository. That doesn't need to be terribly clever; "at least three days old on a globally published branch" would probably be a perfectly good heuristic.
> If yes, it IMHO makes a lot of sense to introduce "submodule objects" which contain > these submodule names, and which are used as pointers to submodule commits in > supermodule trees. You could do it that way, and then it would be global. It would work, and in many ways it would probably be "simpler" on a supermodule level.
I think the implication of "submodule objects" is that supermodule diffs would say "roll submodule X from commit-id A to commit-id B". I don't think that would work very well for pulls/merges in the sparsely populated scenario, because you want to be able to pull the non-canonical subset of the individual diffs between states A and B into the supermodule's object pool. When you decide later to flesh out submodule X, you should only have to clone some canonical mirror and then fast-forward to state B using objects you already have in the supermodule pool. The merge case is even clearer. Suppose I pull updates from two remote branches of the supermodule onto my master branch. Each remote branch has added the same submodule, cloned from third-party repositories whose clone history goes back to the same origin. (The example I have in mind is when some project switches to git from some other SCM, and the maintainers of the remote branches port their integration patches over from their git-svn tracker submodule to a clone of upstream's new git repo.) I should be able to postpone the merge effort, come back later and clone the upstream repo, then merge the non-canonical commits that were pulled earlier. I might want to decide at supermodule pull time to postpone pulling the bodies of the submodule commits; but I want the full sequence of submodule commit IDs in the supermodule commit object. So it's not so much the supermodule _state_ that has a hierarchical structure; it's the supermodule _diffs_ and _object_pool_ that become hierarchical.
The advantage of a global namespace is that you can much more easily update it - "git fetch" will just fetch the new file(s) that describe the subprojects very naturally if they are all global. Putting them in a local .git/config file has it's advantages (see above), but it also makes it very hard to version them, and to update the list - it would have to become manual.
I think the only global-to-local-namespace mapping applies to the different labels for the "empty repository" state generated at init-db time. Given the init-db SHA1 of the linux kernel repository, I should be able to choose any mirror or clone of that repository as a source for objects in its "global set". I expect this provider not to scribble on globally published branches, but that isn't even all that critical; anything outside the canonical set is kept in the supermodule's object pool, so I can always blow the submodule away and regenerate it from a different mirror.
There are possibly combinations of the two approaches: have a "global namespace" that describes the canonical place to get the subprojects, but have some way to add local "translation" of the canonical names into locally preferred versions (eg you could just have a way to say "this is the local mirror for that global canonical place") Maybe that would work?
Sure. But all you really need from the canonical place is its init-db SHA1 (permanent) and its list of globally published branches (monotonically expanding). A URL for it is a convenient shorthand but doesn't have to be persistent. Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html