Re: Git commit generation numbers

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Fri, 15 Jul 2011 14:17:26 -0700

On Fri, Jul 15, 2011 at 12:48 PM, Jeff King <peff@xxxxxxxx> wrote:
>
> Having the information in two different places is my concern, too. And I
> think the fundamental difference between putting it inside or outside
> the commit sha1 (where outside encompasses putting it in a cache, in the
> pack-index, or whatever), is that I see the commit sha1 as somehow more
> "definitive". That is, it is the sole data we pass from repo to repo
> during pushes and pulls, and it is the thing that is consistency-checked
> by hashes.

Sure. That is also the data that is the same for everybody.

That's a big deal, in the sense that it's the only thing we should
rely on if we want consistent behavior. Immediately if core
functionality starts using any other data, behavior becomes "local".
And I think that's really *really* dangerous.

Sure, we have "local behavior" in a lot of small details. We very much
intentionally have it in the ref-logs, and since branches and tags are
local we also have it in things like "--decorate", which obviously
depends on exactly which local refs you have.

We also have local behavior in things like .git/config etc files, so
git can behave very differently for different people even with what is
otherwise an identical repository.

So local behavior is good and expected for some things. We *want* it
for things like colorization decisions, we want it for aliases, and we
want it for branch naming.

But really core behavior shouldn't depend on local information. I
think it would be wrong if something like a merge base decision would
be based on any local information.

For example, should it matter whether something is packed or not? I
really don't think so. That's a pretty random implementation detail,
and if we get different end results because some commit happens to be
packed, vs not packed (because, say, we'd be hiding generation
information in the pack) that would be wrong.

Now, we do have things like merge resolution caches etc (which
obviously do save and use local information again), but I think that's
pretty well clarified.

> So if there is an inconsistency between what the parent pointers
> represent, and what the generation number in "outside" storage says,
> then the outside storage is wrong, and the parent pointers are the right
> answer. It becomes a lot more fuzzy to me if there is an inconsistency
> between what the parent pointers represent, and what the generation
> number says.

So I really don't see why you harp on that. If the generation counters
are in the objects THEY BY DEFINITION CANNOT BE INCONSISTENT.

That's a big issue.

Sure, they may be LYING, but that's a different thing entirely. They
will be lying to everybody consistently. There would never be any
question about what the generation number of a commit is.

See what I'm trying to say? There's no way that they would cause
different behavior for different people. Everything is 100%
consistent.

The exact same thing is true of commit dates, btw. They may be
confused as hell, and they may cause us to do bad things when we
traverse the history, but different clocks on different machines will
still not cause git to act differently on different machines. There's
no possibility of inconsistency.

(Of course, different *versions* of git may traverse the history
differently, since we've changed the heuristics over time. So we do
have that kind of inconsistent behavior, where we give different
results from different versions of git).

And btw, having "incorrect" data in the git objects is not the end of
the world. You can generate merge commits that simply have the wrong
parents. That will be confusing as hell to the user, and it will make
future merges not work very well, but it's a bug in the archive, and
that's "ok". The developers may not be very happy about it. In fact,
afaik we've had a few cases like that in the kernel tree, because
early git had bugs where it would not properly forget parents after a
failed merge. Most of them are ARM-related, because the ARM tree was
one of the first users of git (outside of me, but I had fewer issues
with what happens when things go wrong).

So I would not be *too* shocked if we'd end up with "odd" generation
counts due to some odd bug. It sounds unlikely, but my point is that
that is not at all what I'd *worry* about.

> How should that situation be handled? Should fsck check for it and
> complain? Should we just ignore it, even though it may cause our
> traversal algorithms to be inaccurate? Like clock skew, there's not much
> that can be done if the commits are published.

Right. I simply think it's not a big deal.

IOW, if we would rely on generation counts instead of clock dates,
maybe the generation counts would have occasional problems too, but I
suspect they'd be *much* rarer than time-based issues, because at
least the generation count is a well-defined number rather than a
random thing we pick out of emails and badly maintained machines.

That said, I'm not 100% sure at all that we want generation numbers at
all. Their use is pretty limited. If we had had them from the
beginning, I think we would simply have replaced the date-based commit
list sorting with a generation-number-based one, and it should have
been possible to guarantee that we never output a parent before the
commit in rev-parse.

As it is, I have to admit that looking at it, I shudder at changing
the current date-based logic and replacing it with a "date or
generation number".

The date-based one, despite all its fuzziness and not being very well
defined ("Global clock in a distributed system? You're a moron") and
up being a *nice* heuristic for certain human interaction. So it's not
a wonderful solution from a technical standpoint, but it does have (I
think) some nice UI advantages.

(For an example of that: using "--topo-sort" for revision history may
be a very good thing technically, but even if it wasn't for the fact
that it's more expensive, I think that our largely time-based default
order for "git log" in many ways is a better interface for humans. Of
course, when mixed with actually giving a history graph, that changes,
because then you want the "related" commits to group together, rather
than by time. So I think it's just basically a fuzzy area, without any
clear hard rules - which is probably why using that fuzzy timestamp
works so well in practice)

> Those are serious questions that I think should be considered if we are
> going to put a generation header into the commit object, and I haven't
> seen answers for them yet.

I do agree that the really *big* question is "do we even need it at
all". I do like perhaps just tightening the commit timestamp rules.
Because I do think they would probably work very well for the
"contains" problem too.

With the exact same fuzzy downsides, of course. Timestamps aren't
perfect, and they need that annoying fuzz factor thing.

>> That said, I don't think a generation count necessarily "fits" in the
>> pack-file. They are designed to be incremental, so it's not very
>> natural there. But I do think it would be conceptually prettier to
>> have the "depth of commit" be part of the "filesystem" data than to
>> have it as a separate ad-hoc cache.
>
> Sure, I would be fine with that. When you say "packfile", do you mean
> the the general concept, as in it could go in the pack index as opposed
> to the packfile itself? Or specifically in the packfile? The latter
> seems a lot more problematic to me in terms of implementation.

I was thinking the "general" issue - it might make most sense to put
them in the index.
>> If we should have fixed it in the original specification, we damn well
>> should fix it today. It's been "ignorable" because it's just not been
>> important enough. But if git now adds a fundamental cache for them,
>> then that information is clearly no longer "not important enough".
>
> OK, so let's say we add generation headers to each commit. What happens
> next? Are we going to convert algorithms that use timestamps to use
> commit generations? How are we going to handle performance issues when
> dealing with older parts of history that don't have generations?

So I do think the _initial_ question need to be the other way around:
do we have to have generation numbers at all?

I think it's likely a design misfeature not to have them, but
considering that we don't, and have been able to make do without for
so long, I'm also perfectly willing to believe that we could speed up
"contains" dramatically with the same kind of (crazy and inexact)
tricks we use for merge bases.

(Looking at a profile, a third - and the top entry - of the "git tag
--contains" profile cost is just in "clear_commit_marks()" - not doing
any real work, rather *undoing* the work in order to re-do things. So
it's entirely possible that the real issue is simply that
"in_merge_bases()" is badly done, and we could speed things up a lot
independently of anything else).

For example, for the "git tag --contains" thing, what's the
performance effect of just skipping tags that are much older than the
commit we ask for?

                Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html