On Thu, 12 Apr 2007, Sam Vilain wrote: > > > > And yes, that's absolutely true, but it's technically no different from > > just using GIT_OBJECT_DIRECTORY to share objects between totally unrelated > > projects, or using git/alternates to share objects between (probably > > *less* unrelated repositories, but still clearly individual repos). > > Would that be the only distinction? > > Would submodules be descended into for object reachability questions? I think we'll eventually want that *regardless* of how the object handling is done (a kind of "cross-submodule boundary check"), but I think that's actually outside of the scope of the current fsck. The current fsck goes to great lengths to make sure that the internal consistency of a repository is good. That's also why it takes so long, and why it is such an expensive operation to do (notably when you do a "--full" check). In contrast, the "cross-submodule boundary check" is a much cheaper operation, *if* you have already verified that the projects are internally consistent. It literally boils down to doing a very simplified commit chain walker that only parses tree objects and simply spits out the SHA1's of the sub-tree commits (and their location in the tree), and then a separate phase that just verifies those against the submodules. And that separate phase - once you've done the fsck for all the *individual* repositories - is truly trivial. It's literally just a matter of "is that SHA1 a valid commit object". That's *cheap*. See? > I'm particularly interested in repositories with, say, thousands of > submodules but only a few hundred meg. I really want to avoid the > situation where each of those submodules gets checked or descended into > separately for updates etc. So I think that the way to verify a superproject is: - fsck each and every project totally independently. This is something you have to do *anyway*. - either as you fsck, or as a separate phase after the fsck, just traverse the trees and spit out "these are the SHA1's of subprojects" - finally, just go through the list of SHA1's (after every project has been fsck'd) and verify that they exist (since if they exist, they will have everything that is reachable from them, as that's one of the things that the *local* fsck verifies) Notice? At no point do you actually need to do a "global fsck". You can do totally independent local fsck's, and then a really cheap test of connectedness once those fsck's have completed. The reason a *full* global fsck is so expensive is that it would have an absolutely humungous working set, and effectively keep everything in memory through it all. Doing it in stages ("fsck smaller individiual trees separately") is actually the same amount of absolute work, but the working set never grows, so it scales much better. (fsck'ing projects individually also happens to allow you to do the sub-project fsck's in parallel across multiple CPU's or multiple machines, so it actually scales much better that way too - but the big problem tends to be excessive memory use, so the "SMP parallel version" only makes sense if you have tons of memory and can afford to do these things at the same time!) Linus - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html