Re: Git commit generation numbers

Geert Bosch <bosch@xxxxxxxxxxx> · Thu, 14 Jul 2011 22:41:57 -0400

On Jul 14, 2011, at 21:19, Linus Torvalds wrote:
> But dammit, if you start using generation numbers, then they *are*
> required information. The fact that you then hide them in some
> unarchitected random file doesn't change anything! It just makes it
> ugly and random, for chrissake!

Generation numbers never will be required information, because we
can always compute them. These numbers are really much more similar 
to other pack index information than anything else.

<aside>
Sometimes I wish we'd have general "depth" information for each
SHA1, which would be the maximum number of steps in the DAG to reach
a leaf. This way, if we want to do something like "git log
drivers/net/slip.c", we don't have to bother reading the majority
of trees that have a depth less than two. The depth can also be used
as a limiter for "contains" operations, where we want to see if
commit X contains commit Y: depth (X) has to be at least depth (Y).

However, any such notion, wether generation or depth or whatever
else we'll think of tomorrow, is something particular to a certain
implementation of git. It does not add anything to the information
we stored.
</aside>

I don't think my commit should have a different SHA1 from yours,
because your tree has a more generation numbers than mine.

The beauty and genius of GIT is that it just takes the minimum
amount of data needed to uniquely identify the information to be
stored, and stores that in a UNIQUE format. By allowing generation
numbers to either be present or absent, that's all broken.

It's like computing the SHA1 of compressed data: it doesn't depend
on the data we store, just about the particular representation we
choose. Fortunately we have done away with the first mistake.

So, if you're going to add generation numbers, there has to be a
flag day, after which generation numbers are required everywhere. 
Of course it would be possible to recognize "old style" commits 
and convert them on the fly, but that is true for pretty much 
any format change. However, adding redundant information seems 
like a poor excuse for having a flag day.

Storing generation data in pack indices on the other hand makes
perfect sense: when we generate these indices, we do complete
traversals and have all required information trivially at hand.  We
can never have that many loose objects, so lack of generation
information there isn't a big deal. By storing generation information
in the index, we can be sure it is consistent with the data contained
in the pack, so there are no cache invalidation issues.

I know I must have missed some stupid and obvious reason why
this is all wrong, I just don't quite see it yet.

  -Geert
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html