Re: Git commit generation numbers

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Thu, 14 Jul 2011 11:47:45 -0700

On Thu, Jul 14, 2011 at 11:37 AM, Jeff King <peff@xxxxxxxx> wrote:
>
> I'd love to have in-commit generation numbers. I'm just not sure we can
> get the speeds we want without caching them for existing commits.

So my argument would be that we'd simply be much better off fixing the
fundamental data structure (which we can), and let it become the
long-term solution.

Now, if *may* turn out that we'd want to have some cache for
generation numbers in commits that don't have them, but I absolutely
think that that should be a "add-on" rather than anything fundamental.
For example, if we just merge the "add generation numbers to the
commit object" logic first, then the "cache" case never really needs
to care about us generating new commits. They simply won't need the
cache.

Also, I suspect that the cache could easily be done as a *small* and
*incomplete* cache, ie you don't need to cache all commits, it would
be sufficient to cache a few hundred spread-out commits, and just know
that "from any commit, the cached commit will be quickly reachable".

> I'm not sure that is the best plan. Calculating generation numbers
> involves going to all roots. So once you have to find any generation
> number, it's going to be expensive, no matter how many recent commits
> have generation numbers already in them (but it won't get _more_
> expensive as more commits are added; you'll always be traversing from
> the commit in question down to the roots).

It only ends up being expensive if the commit has parents that don't
have generation numbers.

That's a fairly short-term problem. For the kernel, for example,
basically no development happens on a base that is older than one or
two releases. So if I (and Greg, with the stable tree) start using my
patch, within a couple of weeks, pretty much all development would
have a generation number in its history.

Sure, sometimes I'd merge from people who based their tree on
something old, and I'd end up calculating it all. But it would get
progressively rarer.

> As we add new commits with generation numbers, we won't need to do a
> calculation to get their numbers. But if you are doing something like
> "tag --contains", you are going to want to know the generation number of
> old tags (otherwise, you can't know whether your cutoff might hit them
> or not). IOW, even if we add generation numbers _today_, every "tag
> --contains" in linux-2.6 is going to end up traversing from v3.0-rc7
> down to the roots to get its generation number (v3.0-rc8 would get an
> embedded generation, of course).

So that could easily be handled by caching. In fact, I suspect that
you could make the cache no associate with a commit ID, but be
associated with the tags and heads. But again, then the cache would be
a "secondary" issue, not something fundamental.

> So if you aren't going to cache generation numbers, then you might as
> well write your traversal algorithm to assume you don't know them for
> old commits.

But that's how our algorithms are *already* written.

So why not have that as the fallback? You get the advantage of
generation numbers only with modern things, but those are the ones you
actually tend to use.

Merge bases are *very* seldom historical, for example.

                     Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html