On Fri, 1 Dec 2006, sf wrote: > > Linus Torvalds wrote: > ... > > Think of it this way: one common use for submodules is really to just > > (occasionally) track somebody elses code. The submodule should be a > > totally pristine copy from somebody else (ie it might be the "intel driver > > for X.org" submodule, maintained within intel), and the supermodule just > > refers to it indirectly (ie the supermodule might be the "Fedora Core X > > group" which contains all the different drivers from different people). > > Could you please be a little bit more specific about how you would store the > "pristine copy". Note that it's not necessarily "pristine", since the submodule clearly is a local git repository in its own right. So like _any_ git repository, you can (and may well end up) having your own local branches in the submodule, with your own local modifications. So I'm not claiming that a submodule must always match some external git tree 100%, and that it must be read-only or anything like that. I'm just saying that I suspect that quite often, one of the MOST IMPORTANT parts is that the submodule is really something that somebody else technically maintains, and that this is actually one of the _reasons_ why it is a submodule in the first place. For example, a lot of projects end up having some kind of "library component" as a submodule. Take something like a video player project, which would have something like ffmpeg as a submodule, not because you'd maintain ffmpeg yourself, but simply because (let's say) the library interface changes enough, or you need a specific version with some of your own fixes that haven't been released widely yet, so you want to carry all the libraries you need _with_ you, even though you don't really maintain that submodule. You at most have some small extensions of your own. Now, in this situation, it's relaly really _important_ that the submodule really is totally independent of the supermodule, for several reasons. For example, since you don't "really" own that project, carrying around your own fixes is really really painful. We know it happens all the time, and a lot of projects end up needing their own version, but the _last_ thing you want is to be in merge hell all the time. So as a supermodule maintainer, the best possible thing for you is to be able to push back those local changes to the original project maintainer, so that you _don't_ have to maintain your own changes. But you need to realize that the real maintainer of the submodule is TOTALLY UNINTERESTED in your supermodule. He's not going to maintain it, and in fact, if you have anything in the submodule that ends up talking about your supermodule, that's just going to make it a lot less likely that the upstream maintainer will ever pull your changes. He might take a diff from you, but in a perfect world, you'd actually be able to tell him: "Hey, I've got a git repository with a few fixes to your ffmpeg git tree, please pull from git://myhost.com/submodule.git to get these fixes: ... explanation of fixes and commits that are relevant to ffmpeg, and have nothing to do with the supermodule, except that you need those bug-fixes because you _use_ ffmpeg ... Thanks" See? So this is why it's really important that the submodule really is a git repository in its own right, and why committing stuff in the supermodule NEVER affect the submodule itself directly (it might _cause_ you to also do a commit in the submodule indirectly, but the submodule commit MUST be totally independent, and stand on its own). Now, you don't _have_ to push things upstream, of course. You can always just maintain your own submodule branch, and every once in a while, inside the submodule, you do # fetch the development in the origin/master branch git fetch submodule-origin origin/master # rebase our own special magic sauce on top of that git rebase origin/master to update your submodule, and _then_ you do a commit in the supermodule (after testing that the update is all ok, of course) which will update the "commit" pointer in the supermodule. Notice? In this example, we really maintained the submodule AS a submodule. It was independent, but tied into the supermodule, so that when we clone the supermodule, or do things like bisection on a supermodule, we always end up cloning the submodule too (and in the case of bisection, we really only bisect the supermodule, but the submodule always gets "tracked" in the sense that we would always check out the state of the submodule that was appropriate for that particular commit in the supermodule). > There seems to be some agreement to store the commit id of > the submodule instead of a plain tree id in the supermodules tree object, and > that all objects that are reachable from this commit are made part of the > supermodule repository (either fetched or via alternates). Do you agree? Well, I would actually argue that you may often want to have a supermodule and then at least have the _option_ to decide to not fetch all the submodules. For an example of this kind of usage, let me tell you how we operated at Transmeta a few years ago, which I'm not saying is the _only_ way to operate, but it's ONE way to do it, and I'll also explain _why_ we did it, and why we had submodules. In the case of transmeta, we had our own tools, our own programs, and we "owned" all of those. We _also_ used a lot of external tools, like gcc etc. However, different people worked on different parts, and if you worked on the actual x86 JIT part, you probably didn't want to have all of the gcc stuff in your tree _too_. That just took a lot of space, and you really didn't want to compile the whole toolchain (which took hours), since there were precompiled binaries readily available. Still, from a _release_ standpoint, when we released a new binary, that binary very much depended not just on the actual JIT sources, but on the whole toolchain. So if you wanted to be able to re-create a release, you really needed _everything_. You couldn't just take the "current version" of the toolchain, you needed to have the toolchain that was used AT THE TIME OF THE RELEASE. And this is a _classic_ example of when you'd want to use submodules. Notice how everybody wanted _some_ of the submodules, but really only the release people wanted them _all_. The higher up the chain you were, the less likely you were to really want to muck around with the compiler and the linker, for example. And nobody really owned all modules. So what you really want is: - a supermodule maintainer that is not really the maintainer of _any_ of the submodules, but that does the main "build world" infrastructure (and generally would tend to also maintain the source control infrastructure itself) - submodules that had their own maintainers, and where the maintainers may or may not have wanted the supermodule, but even when they wanted the supermodule, they might not want _all_ of the submodules, simply because they just didn't care. - some of the submodules then have _upstream_ sources that were totally independent, and that you would want to track, but you had zero power AT ALL over them, and yet you migt well want to push back at least some of the fixes you did - at least the ones that made sense even outside your own project - just to avoid having to maintain a _huge_ set of internal patches. So no, I don't think the supermodule should even _force_ people to always get all the submodules. It migth be the default case, but at the same time, it's just being polite to let users decide on their own whether they really want _all_ of the build infrastructure sources. > If I understand you correctly you cannot make any changes to the submodules > code _in the supermodule's repository_, no bugfixes, no extensions, no > adaptions, nothing. Do you mean that? Yes. I think you should make all changes _within_ the submodule, because the submodule should still be an independent git tree in its own right. But obviously, you'd often use a private _branch_ in the submodule beause you end up having whatever private extensions. That's always true: we always have the "master" branch that is kind of the default "private branch" for any repository, but obviously that is often extended upon, and you may have several private branches. For example, after you've done a big update (from some external upstream source) in the submodule that you are using, you migth decide that you do all the work on that new big update in a _new_ private branch within the submodule - and get the submodule changes all squared away on its own _before_ you then decide to commit the end result (the tip of that new private branch) within the supermodule. Ie, you very much should be able to to do git clone supermodule/that/one/submodule my-own-version-of-submodule to clone a submodule _without_ getting anything else (but still get all the work you did within he submodule - very much including your own private branch work). And the importance of keeping the submodule independent is partly just stability and sanity, but partly also scalability. For example, the "index" in a supermodule should NOT include the indexes of all the submodules. That's really important, because the index doesn't really scale. Things do slow down with large indexes. For example, git can handle tens of thousands of files easily. I suspect it scales well to hundreds of thousands of filenames. But with supermodules, you really can end up in the situation where you have _tens_ of these submodules, maybe even hundreds. And if you try to maintain one unified index for the _whole_ thing, I guarantee you that you'll start feeling the pain. Indexing millions of files is just not going to be pretty. So just from a git stability and scalability point, it's important to keep subprojects _separate_. There is obviously integration stuff, but they should still be seen as truly independent projects. Even the supermodule should have clearly its own life even _regardless_ of submodules, because (as I said) quite often you may want the supermodule, but you don't want to have _all_ of the submodules. But it's more than that stability and scalability thing too - keeping them separate is what allows you to do pulls and pushes on an individual subproject basis, and have people really work at that level. For example, if you're the compiler guy at a company, you really do want to work with other compiler people _outside_ the company, but you sure as hell may not be able to give them access to your supermodule. But you may want to work on _just_ the compiler parts (or at least share some branches in public), which means that the subproject really has to be able to work _independently_ of the supermodule. So "independent" here is really key, for several reasons. And that all means, for example, that here must NEVER be any "backpointers". A subproject really can _never_ have backpointers to the superproject, because that fundamentally means that the above kind of "compiler guy works on the compiler subproject in public" cannot work, if your supermodule isn't public. Linus - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html