Re: Git commit generation numbers

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Thu, 14 Jul 2011 18:19:30 -0700

On Thu, Jul 14, 2011 at 1:31 PM, Jeff King <peff@xxxxxxxx> wrote:
>
> However, I'm not 100% convinced leaving generation numbers out was a
> mistake. The git philosophy seems always to have been to keep the
> minimal required information in the DAG.

Yes.

And until I saw the patches trying to add generation numbers, I didn't
really try to push adding generation numbers to commits (although it
actually came up as early as July 2005, so the "let's use generation
numbers in commits" thing is *really* old).

In other words, I do agree that we should strive for minimal required
information.

But dammit, if you start using generation numbers, then they *are*
required information. The fact that you then hide them in some
unarchitected random file doesn't change anything! It just makes it
ugly and random, for chrissake!

I really don't understand your logic that says that the cache is
somehow cleaner. It's a random hack! It's saying "we don't have it in
the main data structure, so let's add it to some other one instead,
and now we have a consistency and cache generation problem instead".

Just look at the size of the patches in question. Your caching patches
are bigger and more complicated. Sure, part of it is that your series
adds the code to _use_ the generation number, but look purely at the
code to maintain them.

Why do you think the odd separate cache is somehow better than just
doing it right? Seriously? If we require the generation numbers, then
they have *become* that minimal information that we should save!

 And I think that has served us
> well, because we're not saddled with cruft that seemed like a good idea
> early on, but isn't.

Again - we discussed adding generation numbers about 6 years ago. We
clearly *should* have done it. Instead, we went with the hacky "let's
use commit time", that everybody really knew was technically wrong,
and was a hack, but avoided the need.

Now, six years later, you clearly are saying that we need the
generation numbers, but then you go off and try to say that they
should be in some secondary non-architected random collection of data
structures that isn't covered by the security and maintenance
guarantees that the core git objects are.

Dammit, one of the things that makes git special is that the data
structures are NOT random odd ad-hoc files. There is a design to them.

> Generation numbers are _completely_ redundant with the actual structure
> of history represented by the parent pointers.

Not true. That's only true if you add ".. if you parse the whole
history" to that statement.

And we've *never* parsed the whole history, because it's just too
expensive and doesn't scale. So right now we depend on commit dates
with a few hacks.

So no, generation numbers are not at all redundant. They are
fundamental. It's why we had this discussion six years ago.

> And so that seems a bit hack-ish to me.

Um? If you feel that way, then why the hell are you pushing your EVEN
MORE HACKISH CACHE PATCHES?

That's what this really boils down to. I think that if we have a value
that we need, then it should be recorded. In the data structures. Not
in some random other location that isn't part of the real git data
structures.

We don't do caches in git, because we don't NEED to. Sure, gitk has
it's hacky cache, but that's not core functionality.

I think it's a sign of good design that we can do a "find .git" and
explain every single file, and show that it's all core functionality
(again, with the exception of "gitk.cache", and I suspect that's
because gitk is a script, not because of any really fundamental data
issues), and explain it.

I think the *cache* is a hell of a lot more hacky than just doing it right.

> I liken it somewhat to the "don't store renames" debate.

That's total and utter bullshit.

Storing renames is *wrong*. I've explained a million times why it's
wrong. Doing it is a disaster. I know. I've used systems that did it.
It's crap. It's fundamentally information that is actively misleading
and WRONG. It's not even that you can do rename detection at run-time,
it's that you *HAVE* to do rename detection at run-time, because doing
it at commit time is simply utterly and fundamentally *wrong*.

Just look at "git blame -C" to remind yourself why rename information is wrong.

But even more importantly, look at git merges. Look at how git has
gotten merging right since pretty much day #1, and has absolutely no
issues with files that got generated two different ways. Look at every
SCM that tries to do rename detection, and look at how THEY CANNOT DO
MERGES RIGHT.

It's that simple. Rename detection is not about avoiding "redundant
data". It's about doing the right thing.

                          Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html