Re: Git commit generation numbers

"George Spelvin" <linux@xxxxxxxxxxx> · 21 Jul 2011 16:27:22 -0400

> There is also another issue that I have mentioned, namely incomplete
> clones - which currently means shallow clone, without access to full
> history.

As far as history walking is concerned, you can just consider "missing
parent" the same as "no parent" and start the generation numbers at 0.
As long as you recompute

> Nb. grafts are so horrible hack that I would be not against turning
> off generation numbers if they are used.

Yeah, but it's not too miserable to add support (the logic is very similar
to replace objects), and then you would be able to have the history walking
code depend on the presence of generation numbers.  (The "load the cache"
function would regenerate it if necessary.)

Only do this if you already have support for "no generation numbers" in
the history walking code for (say) loose objects.

> In the case of replace objects you need both non-replaced and replaced
> DAG generation numbers.

Yes, the cache validity/invalidation criteria are the tricky bit.
Honestly, this is where the code gets ugly, not computing and storing
the generation numbers.

One thought on an expanded generation number cache:

There are many git operations that use ONLY the commit DAG, and do not
actually use any information from the commits other than their hashes
and parent pointers.  The ones that come to mind are rev-parse, rev-list,
describe, name-rev, and merge-base.

These could be sped up if, instead of just generation numbers, we kept
a complete cached copy of the commit DAG, so the commit objects didn't
have to be uncompressed and parsed.

This could be provided by an extended form of generation number cache.
In addition to listing the generation number of each commit, it
would list all the ancestors (by file offset rather than hash, for
compactness).  Then simple commit walking could load this cache and
avoid unpacking commit objects from packs.

A compact implementation would abuse the flexibility of generation numbers
to make them serve double duty.  They would be used as offsets into a
table of parent pointers.  By keeping the table topologically sorted,
the offsets would satisfy the requirements for generation numbers, but
would be unique, and there would be additional gaps when a commit had
multiple parents.

The parent pointers would themselves be 31-bit offsets into the table of
SHA-1 hashes, with the msbit meaning "this commit has multiple parents,
also look at the following table entry".  (If we use offset 0 to mean
"no parents", it might be more convenient to have the offset point to
the *end* of the run of parents rather than the beginning, so "following"
would be earlier in the file, but that's an implementation detail.)

I'm assuming that 2^31 commits having (in aggregate) 2^32 parents would
be enough for the time being.  As a local cache, it can be extended
with a software upgrade.  There's no need to ever have support for two
formats in any given release; just notice that the cache format is wrong,
blow it away, and regenerate it.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html