Re: topological index field for commit objects

Marc Strapetz <marc.strapetz@xxxxxxxxxxx> · Thu, 30 Jun 2016 00:15:36 +0200

On 29.06.2016 22:39, Junio C Hamano wrote:
Stefan Beller <sbeller@xxxxxxxxxx> writes:

On Wed, Jun 29, 2016 at 11:59 AM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
On Wed, Jun 29, 2016 at 11:31 AM, Marc Strapetz
<marc.strapetz@xxxxxxxxxxx> wrote:
This is no RFE but rather recurring thoughts whenever I'm working with
commit graphs: a topological index attribute for commit objects would be
incredible useful. By "topological index" I mean a simple integer for which
following condition holds true:

Look for "generation numbers" in the list archive, perhaps?

Thanks for the pointer to the interesting discussions.

In http://www.spinics.net/lists/git/msg161363.html
Linus wrote in a discussion with Jeff:

Right now, we do *have* a "generation number". It's just that it's
very easy to corrupt even by mistake. It's called "committer date". We
could improve on it.

Would it make sense to refuse creating commits that have a commit date
prior to its parents commit date (except when the user gives a
`--dammit-I-know-I-break-a-wildy-used-heuristic`)?

I think that has also been discussed in the past.  I do not think it
would help very much in practice, as projects already have up to 10
years (and the ones migrated from CVS, even more) worth of commits
they cannot rewrite that may record incorrect committer dates.
You'd need something like "you can trust committer dates that are
newer that this date" per project to switch between slow path and
fast path, with an updated fsck that knows how to compute that
number after you pulled from somebody who used that overriding
option.

If the use of generation number can somehow be limited narrowly, we
may be able to incrementally introduce it only for new commits, but
I haven't thought things through, so let me do so aloud here ;-)

Suppose we use it only for this purpose:

 * When we have two commits, C1 and C2, with generation numbers G1
   and G2, we can say "C1 cannot possibly be an ancestor of C2" if
   G1 > G2.  We cannot say anything else based on generation
   numbers (or lack thereof).

then I think we could just say "A newly created commit must record
generation number G that is larger than generation numbers of its
parent commits; ignore parents that lack generation number for the
purpose of this sentence".

From algorithm perspective, for already existing repositories you would 
still have to switch from an optimized generation number code to the 
current commit-time based code. That could things make even more complex 
and it's possibly expensive to determine whether a repository has full 
generation number support or not.

On the other hand, for new repositories, you could immediately use 
generation number based algorithms. So it could be "A newly created 
commit must record generation number G that is larger than generation 
numbers of its parent commits if all parents commits have a generation 
number recorded; otherwise do not record a generation number". Something 
like "git filter-branch" might already be sufficient to convert 
repositories.

Git versions released in 2019 may start issuing warnings if HEAD has no 
generation number assigned and Git versions released in 2025 may 
completely refuse to work with such repositories.

In the interim period, a local cache as Jeff is proposing could serve as 
secondary source for generation numbers. This would allow to phase out 
current algorithms immediately.

-Marc

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html