Re: Git commit generation numbers

Jeff King <peff@xxxxxxxx> · Fri, 15 Jul 2011 15:48:07 -0400

On Fri, Jul 15, 2011 at 09:10:48AM -0700, Linus Torvalds wrote:

> I think it's much worse to have the same information in two different
> places where it can cause inconsistencies that are hard to see and may
> not be repeatable. If git ever finds the wrong merge base (because,
> say, the generation numbers are wrong), I want it to be a *repeatable*
> thing. I want to be able to repeat on the git mailing list "hey, guys,
> look at what happens when I try to merge commits ABC and XYZ". If you
> go "yeah, it works for me", then that is bad.

Having the information in two different places is my concern, too. And I
think the fundamental difference between putting it inside or outside
the commit sha1 (where outside encompasses putting it in a cache, in the
pack-index, or whatever), is that I see the commit sha1 as somehow more
"definitive". That is, it is the sole data we pass from repo to repo
during pushes and pulls, and it is the thing that is consistency-checked
by hashes.

So if there is an inconsistency between what the parent pointers
represent, and what the generation number in "outside" storage says,
then the outside storage is wrong, and the parent pointers are the right
answer. It becomes a lot more fuzzy to me if there is an inconsistency
between what the parent pointers represent, and what the generation
number says.

How should that situation be handled? Should fsck check for it and
complain? Should we just ignore it, even though it may cause our
traversal algorithms to be inaccurate? Like clock skew, there's not much
that can be done if the commits are published.

Those are serious questions that I think should be considered if we are
going to put a generation header into the commit object, and I haven't
seen answers for them yet.

> Partly for that reason, I do think that if the generation count was
> embedded in the pack-file, that would not be an "ugly" decision. The
> pack-files have definitely become "core git data structures", and are
> more than just a local filesystem representation of the objects:
> they're obviously also the data transport method, even if the rules
> there are slightly different (no index, thank god, and incomplete
> "thin" packs).
> 
> That said, I don't think a generation count necessarily "fits" in the
> pack-file. They are designed to be incremental, so it's not very
> natural there. But I do think it would be conceptually prettier to
> have the "depth of commit" be part of the "filesystem" data than to
> have it as a separate ad-hoc cache.

Sure, I would be fine with that. When you say "packfile", do you mean
the the general concept, as in it could go in the pack index as opposed
to the packfile itself? Or specifically in the packfile? The latter
seems a lot more problematic to me in terms of implementation.

> > Those things rely on the idea that the git DAG is a data model that we
> > present to the user, but that we're allowed to do things behind the
> > scenes to make things faster.
> 
> .. and that is relevant to this discussion exactly *how*?

Because keeping the generation information outside of the DAG keeps the
model we present to the user simple (and not just the user; the
information that we present to other programs), but lets git still use
the information without calculating it from scratch each time. Just like
we present the data as a DAG of loose objects via things like "git
cat-file", even though the underlying storage inside a packfile may be
very different. I just don't see those two ideas as fundamentally
different.

> It's not. It's totally irrelevant. I certainly would never walk away
> from the DAG model. It's a fundamental git decision, and it's the
> correct one.

Of course not. I never suggested we should.

> And that is what this discussion fundamentally boils down to for me.
> 
> If we should have fixed it in the original specification, we damn well
> should fix it today. It's been "ignorable" because it's just not been
> important enough. But if git now adds a fundamental cache for them,
> then that information is clearly no longer "not important enough".

OK, so let's say we add generation headers to each commit. What happens
next? Are we going to convert algorithms that use timestamps to use
commit generations? How are we going to handle performance issues when
dealing with older parts of history that don't have generations?

Again, those are serious questions that need answered. I respect that
you think the lack of a generation header is a design decision that
should be corrected. As I said before, I'm not 100% sure I agree, but
nor do I completely disagree (and I think it largely boils down to a
philosophical distinction, which I think you will agree should take a
backseat to real, practical concerns). But it's not 2005, and we have a
ton of history without generation numbers. So adding them now is only
one piece of the puzzle.

What's your solution for the rest of it?

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html