Re: [RFC][GSoC] Implement Generation Number v2

Jakub Narebski <jnareb@xxxxxxxxx> · Mon, 23 Mar 2020 14:43:37 +0100

Junio C Hamano <gitster@xxxxxxxxx> writes:
> Abhishek Kumar <abhishekkumar8222@xxxxxxxxx> writes:
>> Jakub Narębski <jnareb@xxxxxxxxx> writes:
[...]
>>> Unfortunately for the time being we cannot use commit-graph format
>>> version; the idea that was proposed on the mailing list (when we found
>>> about the bug in handling commit-graph versioning, during incremental
>>> commit-graph implementation), was to create and use metadata chunk or
>>> versioning chunk (the final version of incremental format do not use
>>> this mechanism).  This could be used by gen2 compatibile Git to
>>> distinguish between situation where old commit-graph file to be updated
>>> uses generation number v1, and when it uses v2.
>>> 
>>> If you have a better idea, please say so.
>>
>> We could also use a flag file. Here's how it works:
>>
>> If the file `.git/info/generation-number-v2` exists, use gen2.
>> Otherwise use gen1.
>
> If the file is lost then we will try to read the other file that has
> the commit-graph data as if it were in old format?  And if such a
> file was created (say, with "touch .git/info/generation-number-v2"),
> a file in the original format will be read as if it is in new
> format?  If that is the case, it is likely that we'd see a segfault;
> sounds too brittle to me.
>
> It appears that the format of "CDAT", and the fact that generation
> is represented as higher 30-bit of a be32 integer, is very much
> hardcoded in the design and is hard to change, but your new version
> of graph file can be designed not to use "CDAT" chunk at all, and
> instead have the commit data with new version of generation numbers
> stored in a different chunk (say "CDA2") to force older version of
> Git not to use the new graph file---would that work?

It looks like there are a few possible ways of handling introduction of
generation numbers v2.  Let's consider them one by one.

The problem we need to solve is co-existence of old Git (that does not
understand v2, and that hard fails on commit-graph format version bump),
and new Git (that understands and writes v2, and that I assume soft
fails that is it simply doesn't use commit-graph if it of unknown
version).

If the commit-graph file was written by new Git, and includes generation
numbers v2, we want old Git to at least do not crash, possibly do not
use commit-graph, best if it can use commit-graph in suboptimal way.  We
also need to handle old Git trying to update (in incremental or
non-incremental way) the commit-graph file.

If the commit-graph file was written by old Git, and includes generation
nmbers v1 (topological levels), we want new Git to recognize this and at
best use those old generation numbers in a correct way.  We want new Git
to be able to update commit-graph file (in incremental or
non-incremental way).

Did I miss anything?

Proposed solutions are:
 - metadata / versioning chunk,
 - flag file: `.git/info/generation-number-v2`,
 - new chunk for commit data: "CDA2".

I would like to propose yet another solution: putting generation number
v2 data in a separate chunk (and possibly keeping generation number v1
in CDAT commit data chunk).  In this case we could even use ordinary
corrected commit date as generation number v2 (storing offsets as 32-bit
unsigned values), instead of backward-compatibile corrected commit date
with monotonic offsets.

Each solution has its advantages and disadvantages.

With the flag file, the problem is (as Junio noticed) that if file gets
accidentally deleted, new Git would think incorrectly that commit-graph
uses generation number v1... which while suboptimal should not be bad
thanks to backward compatibility.  But I think the flag file should have
some kind of checksum as its contents (perhaps simply a copy of
commit-graph file checksum, or one checksum per file in chain with
incremental commit-graph), so that it old Git rewrites commit-graph file
leaving flag file present, new Git would notice this.

Metadata or versioning chunk cannot be deleted by mistake; if old Git
copies unknown chunks to new updated commit-graph file instead of
skipping them we would need to add some kind of checksum (similarly to
the case for flag file).  The problem to be solved is what to do if some
files in the chain of commit-graph files have v2 (and this chunk), and
some have v1 generation number (and do not have this chunk).

About moving commit data with generation number v2 to "CDA2" chunk: if
"CDAT" chunk is missing then (I think) old Git would simply not use
commit-graph file at all; it may crash, but I don't think so.  If "CDAT"
chunk has zero length... I don't know what would happen then, possibly
also old Git would simply not use commit-graph data at all.

Putting generation number v2 into separate chunk (which might be called
"GEN2" or "OFFS"/"DOFF") has the disadvantage of increasing the on disk
size of the commit graph, and possibly also increasing memory
consumption (the latter depends on how it would be handled), but has the
advantage of being fullly backward compatibile.  Old Git would simply
use generation numbers v1 in "CDAT", new Git would use generation
numbers v2 in "OFFS" -- combining commit creation date from "CDAT" and
offset from "OFFS"), and there should be no problems with updating
commit-graph file (either rewriting, or adding new commit-graph to the
chain).

I think that's all.

Best,
-- 
Jakub Narębski