Re: [PATCH 5/6] Teach "fsck" not to follow subproject links

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Wed, 11 Apr 2007 16:16:53 -0700 (PDT)

On Thu, 12 Apr 2007, Sam Vilain wrote:
> >
> > And yes, that's absolutely true, but it's technically no different from 
> > just using GIT_OBJECT_DIRECTORY to share objects between totally unrelated 
> > projects, or using git/alternates to share objects between (probably 
> > *less* unrelated repositories, but still clearly individual repos).
> 
> Would that be the only distinction?
> 
> Would submodules be descended into for object reachability questions?

I think we'll eventually want that *regardless* of how the object handling 
is done (a kind of "cross-submodule boundary check"), but I think that's 
actually outside of the scope of the current fsck.

The current fsck goes to great lengths to make sure that the internal 
consistency of a repository is good. That's also why it takes so long, and 
why it is such an expensive operation to do (notably when you do a 
"--full" check).

In contrast, the "cross-submodule boundary check" is a much cheaper 
operation, *if* you have already verified that the projects are internally 
consistent. It literally boils down to doing a very simplified commit 
chain walker that only parses tree objects and simply spits out the 
SHA1's of the sub-tree commits (and their location in the tree), and then 
a separate phase that just verifies those against the submodules.

And that separate phase - once you've done the fsck for all the 
*individual* repositories - is truly trivial. It's literally just a matter 
of "is that SHA1 a valid commit object". That's *cheap*.

See?

> I'm particularly interested in repositories with, say, thousands of
> submodules but only a few hundred meg. I really want to avoid the
> situation where each of those submodules gets checked or descended into
> separately for updates etc.

So I think that the way to verify a superproject is:

 - fsck each and every project totally independently. This is something 
   you have to do *anyway*.

 - either as you fsck, or as a separate phase after the fsck, just 
   traverse the trees and spit out "these are the SHA1's of subprojects"

 - finally, just go through the list of SHA1's (after every project has 
   been fsck'd) and verify that they exist (since if they exist, they will 
   have everything that is reachable from them, as that's one of the 
   things that the *local* fsck verifies)

Notice? At no point do you actually need to do a "global fsck". You can do 
totally independent local fsck's, and then a really cheap test of 
connectedness once those fsck's have completed.

The reason a *full* global fsck is so expensive is that it would have an 
absolutely humungous working set, and effectively keep everything in 
memory through it all. Doing it in stages ("fsck smaller individiual trees 
separately") is actually the same amount of absolute work, but the working 
set never grows, so it scales much better.

(fsck'ing projects individually also happens to allow you to do the 
sub-project fsck's in parallel across multiple CPU's or multiple machines, 
so it actually scales much better that way too - but the big problem 
tends to be excessive memory use, so the "SMP parallel version" only 
makes sense if you have tons of memory and can afford to do these things 
at the same time!)

			Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html