Re: Git commit generation numbers

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Sun, 17 Jul 2011 12:30:34 -0700

On Sun, Jul 17, 2011 at 11:27 AM, George Spelvin <linux@xxxxxxxxxxx> wrote:
>
> There are a few design mistakes in git.  The way the object type
> and size are prefixed to the data for hasing purposes, which prevents
> aligned fetching from memory-mapped data in the hashing code, isn't too
> pretty either.

Why would you ever care? That makes no sense.

> But git has generally preferred to avoid storing information that can
> be recomputed.  File renames are the big example.  given this, why the
> heck store generation numbers?

Guys, please don't bring up file renames. I explained once already why
bringing up file renames just makes you look like a f^&% moron.

Let me explain one more time:

 - Storing file renames is STUPID. It's stupid for very fundamental
reasons that have absolutely *NOTHING* to do with "it can be computed
later".

It's fundamentally stupid because it will FOREVER SCREW UP YOUR DATA,
and because it will make merging an unmitigated disaster and make your
repository depend on how you *created* your data, rather than on what
the data is. It will totally break the situation of one person doing a
rename, while another person does something else to the metadata (eg a
create of the same filename).

Trying to track file identities will leave to very fundamentally
unsolvable issues like "which file identity do we choose when two
different files get the same name", or "which file identity will we
choose when one file splits in two".

Git doesn't track renames, because unlike pretty much every other SCM
out there, git really does have a good design, and because I damn well
understood the real problems.

So bringing it up as an example of "we don't store it because we can
compute it" is really totally idiotic. It's a sign of not
understanding the problems with renames. Stop doing it. That argument
is totally irrelevant. Really.

It's like saying "We shouldn't do generation numbers because fish
don't use bicycles". The only thing that kind of argument does is to
make me convinced that you don't understand the problem enough to be
worth even arguing with. It is not only a worthless argument, but it
makes your every other argument suspect.

Comprende? Stop it.

> They *can* be computed on demand, so arguably they *should*.

Umm, no.

That's actually a really bad argument.

There are valid things that we "should" do, but they have nothing to
do with "if something can be done, it should be done". That's just a
crazy argument.

A thing we really *should* do is perform well. And be really reliable.
And support a distributed workflow.

Those are real arguments that aren't about "just because it's there".

Now, some of those arguments can then be used to say "don't bother
storing redundant data". For example, redundant data takes disk space
and network bandwidth, and if something can be recomputed cheaply (ie
if it doesn't have a negative impact on performance), then redundant
data is just bad.

And what appears like a much better argument (right now) is that some
data isn't needed AT ALL, because you can make do with other data
entirely (ie dates).

But "just because we could recompute it" is a bad bad reason.

The thing is, the very basic design of git is all about *incomplete*
DAG traversal. The DAG traversal part is pretty obvious and simple,
but the *partial* thing really is very very important. We absolutely
need it for reasonable scalability. We've spent a *lot* of time in git
development on trying to perform really well by avoiding work. Not
just in revision traversal, but in many other areas too (like making
diff and merge much faster by being able to handle whole identical
recursive subdirectories by just checking the SHA1, for example).

That's a *really* fundamental design issue in git. Performance was
always a primary goal. And by primary, I really mean primary. As in
"more important than just about anything else".  There were other
primary goals, but really not very many.

And there really aren't very good ways to limit DAG traversal.
Generation numbers are one of the very few fundamental ones. We hacked
around it with dates, and it works pretty well in practice (well
enough that I'm certainly ok with the hack), but it's definitely one
of the areas where git simply does something "wrong". It's simply not
a entirely reliable algorithm, and that fact makes me a bit
uncomfortable with it.

(Now, in theory, a global *approximate* time is theoretically possible
in a distributed environment, and as such it's arguable that "global
time with a slop that is based on the speed of light and knowledge of
location" is at least theoretically sound. So the real problem with
commit dates is that people simply don't have good clocks. So it's a
practical problem rather than a theoretical one, and it's a practical
problem that doesn't really cause enough problems in practice to not
be workable. But I'm making excuses for it, and I _know_ I'm making
excuses for it, so I'm not really happy about it)

And it's just about the only area where I am aware of git doing
something "wrong". Which is why I would like to have had generation
numbers even though the dates do work.

Anyway, to get back to the actual issue of caching vs not caching: if
you think "we could compute it dynamically" means that we should, then
we damn well shouldn't cache it either - why cache it, when you could
just compute it. And if it's worth it to waste resources on the cache
in order to avoid performance issues, then it damn well would be ok to
waste (fewer) resources on just saving the generation number in the
object data base. And make that *fundamental* fix to a hack that git
has had since pretty much day one.

And btw, git didn't have the date-based hack originally, because I
didn't think it would be problematic enough. I thought that we could
do universally efficient partial DAG traversal - not having to go all
the way to the root -  based purely on the DAG. The code in
"everybody_uninteresting()" tries to be that "limit DAG traversal by
only looking at the DAG itself", and it works for many simple
situations. But it turns out that it does *not* work for many other
cases.

So the generation number really is very very fundamnetal. It's
absolutely not some "additional information that can be computed",
because the whole AND ONLY point of having the number is to not
compute it.

We are never interested in the generation number for its own sake. We
are only interested in it in order to avoid having to look at the rest
of the DAG.

So no, the number fundamentally isn't computable, because computing it
obviates the need for it.

                           Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html