Re: Git commit generation numbers

Nicolas Pitre <nico@xxxxxxxxxxx> · Wed, 20 Jul 2011 16:51:40 -0400 (EDT)

On Mon, 18 Jul 2011, George Spelvin wrote:

> Storing the generation number inside the commit means that a commit
> with a generation number has a different hash than a commit without one.
> This means that people won't want to break the hashes of existing commits
> by adding them.  In many cases, ever.
> 
> Which means that git will have to be able to work without the generation
> numbers forever.

I've been diverting myself from $day_job by reading through this thread.  
Still, I couldn't make my mind between having the generation number 
stored in the commit object or in a separate cache by reading all the 
arguments for each until now. Admittedly I'm not as involved in the 
design of Git as I once was, so my comments can be considered with the 
same proportions.

Obviously, with a perfect design, we would have had gen numbers from the 
beginning.  But we did mistakes, and now have to regret and live with 
them (and yes I have my own share of responsibility for some of those 
regrets which are now embodied in the Git data format).

> If the generation numbers are stored in a separate data structure that
> can be added to an existing repository, then a new version of git can
> do that when needed.  Which lets git depend on always having the the
> generation numbers to do all history walking and stop using commit date
> based heuristics completely.

To me this is the killer argument.  Being able to forget about the 
broken date heuristics entirely and simplify the code is what makes the 
external cache so fundamentally better as it can be applied to any 
existing repositories.  And it has no backward compatibility issues as 
old Git version won't work any worse if they can't make any usage of 
that cache.

The alternative of having to sometimes use the generation number, 
sometimes use the possibly broken commit date, makes for much more 
complicated code that has to be maintained forever.  Having a solution 
that starts working only after a certain point in history doesn't look 
eleguant to me at all.  It is not like having different pack formats 
where back and forth conversions can be made for the _entire_ history.

And if you don't care about graft/replace then the cached data is 
immutable just like the in-commit version would, so there is no 
consistency issues.  If you do care about graft/replace (or who knows 
what other dag alteration scheme might be created in 5 years from now) 
then a separate cache will be required _anyway_, regardless of any 
in-commit gen number.

So to say that if a generation number is _really_ needed, then it should 
go in a separate cache.  Saying that if we would have done it initially 
then it would have been inside the commit object is not a good enough 
justification to do it today if it can't be applied to the whole of 
already existing repositories and avoid special cases.

I however have not formed any opinion on that fundamental question i.e. 
whether or not gen numbers are worth it in today's conditions. Neither 
did I think about the actual cache format (I don't think that adding it 
to the pack index is a good idea if grafts are to be honored) which 
certainly has bearing on that fundamental question too.

But I don't see the point of starting to add them now to commit objects, 
even if we regret not doing it initially, simply because having them 
appear randomly based on the Git version/implementation being used is 
still much uglier than some ad hoc cache or even not having them at all.

Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html