Re: Git commit generation numbers

Jakub Narebski <jnareb@xxxxxxxxx> · Fri, 15 Jul 2011 02:12:43 -0700 (PDT)

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> writes:
> On Thu, Jul 14, 2011 at 1:31 PM, Jeff King <peff@xxxxxxxx> wrote:
> >
> > However, I'm not 100% convinced leaving generation numbers out was a
> > mistake. The git philosophy seems always to have been to keep the
> > minimal required information in the DAG.
> 
> Yes.
> 
> And until I saw the patches trying to add generation numbers, I didn't
> really try to push adding generation numbers to commits (although it
> actually came up as early as July 2005, so the "let's use generation
> numbers in commits" thing is *really* old).
> 
> In other words, I do agree that we should strive for minimal required
> information.
> 
> But dammit, if you start using generation numbers, then they *are*
> required information. The fact that you then hide them in some
> unarchitected random file doesn't change anything! It just makes it
> ugly and random, for chrissake!
> 
> I really don't understand your logic that says that the cache is
> somehow cleaner. It's a random hack! It's saying "we don't have it in
> the main data structure, so let's add it to some other one instead,
> and now we have a consistency and cache generation problem instead".

You store redundant information, one that is used to speed up
calculations, in a cache.

[...]
> > Generation numbers are _completely_ redundant with the actual structure
> > of history represented by the parent pointers.

What is more important the perceived structure of history can change
by three mechanisms:

 * grafts
 * replace objects
 * shallow clone

I can understand that you don't want to worry about grafts - they are
a terrible hack.  We can simply turn off using generation numbers
stored in commit if they are present.

The problem with shallow clones is only at beginning, when some of
commits in shallow repository does not have generation numbers.  You
cannot simply calculate generation number for a new commit in such
case.

But what about REPLACE OBJECTS?  If one for example use "git replace"
on root commit to join contemporary repository with historical
repository... this is not addressed in your emails.

And let's not forget the fact that we need cache for old commits which
don't have yet generation number in a commit.

BTW. you are not fair comparing size of code.  

First, some of Peff code is about _using_ generation numbers, which
will be needed regardless of whether generation numbers are stored in
cache or packfile index, or whether they are embedded in commit
objects.

Second, with generation number commit header you need to write fsck
code, and have to consider size of this yet-to-be-written code.

[...]
> > I liken it somewhat to the "don't store renames" debate.
> 
> That's total and utter bullshit.

I think Peff meant here that if you make mistakes in calculating
rename info or generation number, and have incorrect information
stored in commit object, you are f**ked.

> Storing renames is *wrong*. I've explained a million times why it's
> wrong. Doing it is a disaster. I know. I've used systems that did it.
> It's crap. It's fundamentally information that is actively misleading
> and WRONG. It's not even that you can do rename detection at run-time,
> it's that you *HAVE* to do rename detection at run-time, because doing
> it at commit time is simply utterly and fundamentally *wrong*.
> 
> Just look at "git blame -C" to remind yourself why rename information is wrong.

Also doing full code movement and copying detection (that is what "git
blame -C" does) rather than simplistic whole-file rename detection is
pretty much impossible at commit time.

Nb. most SCMs that use path-id based rename tracking require that user
explicitly marks renames using "scm move" or "scm rename" (well,
Mercurial has a tool for rename detection before commit, "hg
addremove").  But asking user to mark code movements is simply
infeasible.

> But even more importantly, look at git merges. Look at how git has
> gotten merging right since pretty much day #1, and has absolutely no
> issues with files that got generated two different ways. Look at every
> SCM that tries to do rename detection, and look at how THEY CANNOT DO
> MERGES RIGHT.
> 
> It's that simple. Rename detection is not about avoiding "redundant
> data". It's about doing the right thing.

Well, rename tracking supporters say that heuristic rename detection
can be wrong.

By the way, what happened to "wholesame directory rename detection"
patches?  Without them in the situation where one side renamed
directory, and other created new file in said directory git on merge
creates file in re-created old name of directory...

-- 
Jakub Narebski
Poland
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html