On Wed, May 01 2019, Derrick Stolee via GitGitGadget wrote: > The commit-graph file format has some shortcomings that were discussed > on-list: > > 1. It doesn't use the 4-byte format ID from the_hash_algo. > > 2. There is no way to change the reachability index from generation numbers > to corrected commit date [1]. > > 3. The unused byte in the format could be used to signal the file is > incremental, but current clients ignore the value even if it is > non-zero. > > This series adds a new version (2) to the commit-graph file. The fifth byte > already specified the file format, so existing clients will gracefully > respond to files with a different version number. The only real change now > is that the header takes 12 bytes instead of 8, due to using the 4-byte > format ID for the hash algorithm. > > The new bytes reserved for the reachability index version and incremental > file formats are now expected to be equal to the defaults. When we update > these values to be flexible in the future, if a client understands > commit-graph v2 but not those new values, then it will fail gracefully. > > NOTE: this series was rebased onto ab/commit-graph-fixes, as the conflicts > were significant and subtle. > > Updates in V3: Thanks for all the feedback so far! > > * Moved the version information into an unsigned char parameter, instead of > a flag. > > * We no longer default to the v2 file format, as that will break users who > downgrade. This required some changes to the test script. > > * Removed the "Future work" section from the commit-graph design document > in a new patch. > > * I did not change the file name for v2 file formats, as Ævar suggested. > I'd like the discussion to continue on this topic. I won't repeat my outstanding v2 feedback about v1 & v2 incompatibilities, except to say that I'd in principle be fine with having a v2 format the way this series is adding it. I.e. saying "here it is, it's never written by default, we'll figure out these compat issues later". My only objection/nit on that point would be that the current docs/commit messages should make some mention of the really bad interactions between v1 and v2 on different git versions. However, having written this again I really don't understand why we need a v2 of this format at all. The current format is: <CGPH signature> <CGPH version = 1> <hash version (0..255) where 1 == SHA-1> <num chunks (0..255)> <reserved byte ignored> [chunk offsets for our $num_chunks] [arbitrary chunk data for our $num_chunks] And you want to change it to: <CGPH signature> <CGPH version = 2> <num chunks (0..255)> <reachability index version, hard error on values != 1 (should have seen this in my [1])> <reserved byte hard error on values != 0 [1]> <hash version 32 bit. So 0x73686131 = "sha1" instead of "1"> [chunk offsets for our $num_chunks] [arbitrary chunk data for our $num_chunks] Where "chunks" in the v1 format has always been a non-exhaustive list of things *where we ignore anything we don't know about*. So given our really bad compatibility issues with any non-v1 format I suggested "let's use a different filename". But on closer look I retract that. How about we instead just don't change the header? I.e.: * Let's just live with "1" as the marker for SHA-1. Yeah it would be cute to use 0x73686131 instead like "struct git_hash_algo", but we can live with a 1=0x73686131 ("sha1"), 2=0x73323536 ("s256") mapping somewhere. It's not like we're going to be running into the 255 limit of hash algorithms Git will support any time soon. * Don't add the reachability index version *to the header* or change the reserved byte to be an error (see [1] again). Instead we just add these things to new "chunks" as appropriate. As this patch of mine shows we can easily do that, and it doesn't error out on any existing version of git: https://github.com/avar/git/commit/3fca63e12a9d38867d4bc0a8a25d419c00a09d95 I now can't imagine a situation where we'd ever need to change the format. We have 32 bits of chunk ids to play with, and can have 255 of these chunks at a time, and unknown chunks are ignored by existing versions and future version. We can even have more than 255 if it comes to that by having a special "extension" chunk, or even use the existing reserved byte for that and pull the nasty trick of putting another set after the existing file checksum, but I digress. If we ever find that we e.g. don't want to write SHA-1 data anymore but just want SHA-256 we just write a tiny amount of dummy data. Older git versions will shrug at what looks like a really incomplete commit graph data, but newer versions will know the real data is in some other chunk they know about. Ditto this "gen numbers or adjusted timestamps?" plan in https://public-inbox.org/git/6367e30a-1b3a-4fe9-611b-d931f51effef@xxxxxxxxx/ We can have a chunk of adjusted timestamps into the generation number chunk, and even start adding chunks of other side-data, e.g. the path bloom filters... This E-Mail needs to stop at some point, but as a brief aside I don't see how this die("commit-graph hash algorithm does not match current algorithm") plan makes sense either. The hash-function-transition.txt plan describes how we'll have an index of SHA-1<->SHA-256 object names. Why would it be an error to read a SHA-1 commit-graph under SHA-256? Won't we just say "oh this graph lists the SHA-1s" and then use that lookup table to resolve them as SHA-256s? And as discussed once we go for extending things with chunks we can do a lot better. We'd just keep the SHA-1 data segment, and perhaps (for "just commits" cache locality) have another chunk of SHA-256 mappings to those SHA-1s in the commit-graph file. 1. See feedback on the v2 patch in https://public-inbox.org/git/87muk6q98k.fsf@xxxxxxxxxxxxxxxxxxx/