Re: Git commit generation numbers

Jeff King <peff@xxxxxxxx> · Fri, 15 Jul 2011 03:46:56 -0400

On Thu, Jul 14, 2011 at 06:19:30PM -0700, Linus Torvalds wrote:

> Yes.
> 
> And until I saw the patches trying to add generation numbers, I didn't
> really try to push adding generation numbers to commits (although it
> actually came up as early as July 2005, so the "let's use generation
> numbers in commits" thing is *really* old).
> 
> In other words, I do agree that we should strive for minimal required
> information.
> 
> But dammit, if you start using generation numbers, then they *are*
> required information. The fact that you then hide them in some
> unarchitected random file doesn't change anything! It just makes it
> ugly and random, for chrissake!

So you don't see a difference between storing the information directly
in the commit object, where it affects the sha1 of the commit, and
calculating and storing it somewhere else? That is what seems ungit to
me. You aren't adding new information to the DAG (note I said "DAG" and
not commit) that is not already there, but you are changing the ids of
commits in the DAG.

I'm not saying that's a reason to ultimately reject the idea of putting
generation numbers in commit objects. But it is a reason to give us
pause and figure out if there are other solutions, because it will be
the first time such redundant information has been added. And that's
what I've been trying to do during this discussion with you: work out
what the options are and evaluate them.

> I really don't understand your logic that says that the cache is
> somehow cleaner. It's a random hack! It's saying "we don't have it in
> the main data structure, so let's add it to some other one instead,
> and now we have a consistency and cache generation problem instead".

Are packfiles unclean, or a random hack? How about pack indices? What
about Nico's and Shawn's ideas for a packv4 that would gain efficiency
by storing objects not in their whole format, but in a way that would
make tree examination faster (but would be able to restore the whole
objects byte for byte)?

Those things rely on the idea that the git DAG is a data model that we
present to the user, but that we're allowed to do things behind the
scenes to make things faster. We're allowed to make an index of offsets
of objects in the packfile for faster lookup. Why are we not allowed to
use an index for other object data if it will speed up our local
algorithms?

Again, I'm not saying that the patches I posted are necessarily the
answer. Maybe my cache implementation sucks. Maybe the value should go
into a pack index instead. Maybe the whole idea is stupid. But I don't
think it's worth rejecting out-of-hand the idea that the generation
number might be stored outside of the commit object. I do think it's
worth talking about what the actual downsides are, as compared to other
options.

For example, you mentioned there a consistency problem in the paragraph
above. What is it?

If you mean the problem with refs/replace, then yes, that is an open
problem to be solved (though not a hard one, as I already mentioned a
solution elsewhere). But is that problem better or worse with this
solution versus an embedded generation number? It seems to me that an
embedded generation number is even worse.

> Just look at the size of the patches in question. Your caching patches
> are bigger and more complicated. Sure, part of it is that your series
> adds the code to _use_ the generation number, but look purely at the
> code to maintain them.

It's 300 lines of code. That can also be used to store arbitrary
meta-information for commits. I've already achieved significant speedups
in some workflows by caching patch-id calculations, which would reuse
this code.  And I certainly don't think _that_ should go into the commit
object.

Yes, it's more complex than simply adding a generation number to the
commit header. But simply adding a generation number does not actually
give the 100-fold speedup I'm seeing. So again, I'm not interested in
rejecting solutions out of hand; I'm interested in things like: is the
complexity of the cache worth this speedup? What other options do we
have, and what speedup do they provide? Do we care enough about this
speedup to even bother?

> Why do you think the odd separate cache is somehow better than just
> doing it right? Seriously? If we require the generation numbers, then
> they have *become* that minimal information that we should save!

What do you do when generation numbers don't match the DAG represented
by the parent pointers? Are you proposing to just ignore it? I'm not
asking that question adversarially; ignoring may be the sane thing to
do, and we say "generation numbers are to be trusted, even if they don't
match parent pointers".

> Now, six years later, you clearly are saying that we need the
> generation numbers, but then you go off and try to say that they
> should be in some secondary non-architected random collection of data
> structures that isn't covered by the security and maintenance
> guarantees that the core git objects are.

I don't think I said we clearly need them. I said we can get speedups by
using them, and I showed some patches. I _also_ posted patches showing
how to accomplish similar speedups using timestamps. Note that all of my
patches started with "RFC". I am trying to figure out which is the best
way to proceed.

And why _would_ they need to be covered by the security and maintenance
guarantees of core objects? You can trivially calculate them from the
core objects. Are pack indices also a "secondary non-architected random
collection of data structures"?

> Dammit, one of the things that makes git special is that the data
> structures are NOT random odd ad-hoc files. There is a design to them.

There is just as much documentation and design for the new file format I
added as there is for pack indices (in fact, they're quite similar in
design). I really see them at the same level: something we calculate to
speed up some algorithms, but something we could regenerate at any time
if we felt like.

> > And so that seems a bit hack-ish to me.
> 
> Um? If you feel that way, then why the hell are you pushing your EVEN
> MORE HACKISH CACHE PATCHES?

Please, there is really no need to shout. And I find it quite silly that
you would refer to me as "pushing" these patches when they have been
clearly listed as RFC, and everything I have posted in the nearby
threads has been about comparing different strategies (with patches and
timings for some of those other strategies!).

> We don't do caches in git, because we don't NEED to. Sure, gitk has
> it's hacky cache, but that's not core functionality.

I'm sorry to tell you that there is already a cache for external
conversion of diffs for blobs. And that I have a patch series which
makes "git cherry" much more pleasant to use by caching patch ids.

Do we "need" those? No, of course not. Git works just fine without them,
albeit a bit slower. But is it sometimes worth making a space-time
tradeoff to make some algorithms faster? I think it sometimes is,
depending on the space and time factors, and the complexity of the
storage (e.g., consistency problems with caching).

> I think it's a sign of good design that we can do a "find .git" and
> explain every single file, and show that it's all core functionality
> (again, with the exception of "gitk.cache", and I suspect that's
> because gitk is a script, not because of any really fundamental data
> issues), and explain it.

Would it make you happier if we stored the generation data in the pack
index when we index the packs?

> I think the *cache* is a hell of a lot more hacky than just doing it right.

You still haven't explained how we would "do it right" and get the same
speedups. When I responded to your initial email, your answers were
along the lines of "we could cache fewer things". If your position is to
damn the speedup, the cache is not worth the complexity, I can buy that.
If your position is that the complexity is not worth it, and we are
better off to keep using timestamps, I can buy that. If your position is
that you can find a clever way, using only the generation numbers in
newly created commits, to get similar speedups in "git {tag,branch}
--contains", I'd love to hear it.

> > I liken it somewhat to the "don't store renames" debate.
> 
> That's total and utter bullshit.
> 
> Storing renames is *wrong*. I've explained a million times why it's
> wrong. Doing it is a disaster. I know. I've used systems that did it.
> It's crap. It's fundamentally information that is actively misleading
> and WRONG. It's not even that you can do rename detection at run-time,
> it's that you *HAVE* to do rename detection at run-time, because doing
> it at commit time is simply utterly and fundamentally *wrong*.

Yes, I am well aware that stored renames are wrong for merging. The
problem is that they not a function of a tree state (which is what a
commit stores), but rather of the difference between two states. So when
you diff the commit's state with some other arbitrary merge-base, any
renames recorded at commit time would be worthless.

But consider another case. Each time I run "git log -M --raw", I compute
the same renames over and over. Let's say I have a case in which this is
annoyingly slow, and want to speed it up. The state of a particular
commit and the state of its parents are invariants for a particular sha1
commit id; this is a fundamental property of git, as you well know. So
for a given rename-detection algorithm (and any parameters it has), the
set of renames between the states will also be an invariant.

Now imagine I create a persistent cache mapping the commit sha1 for some
sane default set of rename algorithm parameters to a set of rename
pairs. My annoyingly slow "log -M" is now faster, and I'm happier.

I think you encounter a similar set of questions here as you do with the
concept of a generation header. If the information is an invariant for a
particular commit sha1, can we and should we store it in the commit
object? Is the speedup worth the complexity of a cache? What are the
circumstances under which the cache is not applicable, and how often do
they come up? Can we accurately detect when the cache is not applicable?

And that is why I compared it to the idea of storing renames. Please
note that I did _not_ say they were exactly the same situation, or that
the answers to one set of questions were the same as the answers to
another.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html