On Fri, 7 Aug 2020 at 01:08, brian m. carlson <sandals@xxxxxxxxxxxxxxxxxxxx> wrote: > > On 2020-08-06 at 20:23:58, Martin Ågren wrote: > > After eff45daab8 ("repository: enable SHA-256 support by default", > > 2020-07-29), vanilla builds of Git enable the user to run, e.g., > > > > git init --object-format=sha256 > > > > and hack away. [...] [...] > > Similarly, "push + pull" should work, but you really will be operating > > mostly offset from the rest of the world. That might be ok by the time > > you initialize your repository, and it might be ok for several months > > after that, but there might come a day when you're starting to regret > > your use of `git init --object-format=sha256` and have dug yourself into > > a fairly deep hole. > > I do agree that they don't interoperate right now, and that we'd like it > to in the future. But there are definitely people who can use SHA-256 > support for new projects without problems. I'm aware of certain > government agencies who very much do not want to use SHA-1 at all (and > at some point will be legally prohibited from doing so), and they will > be completely fine with the status quo. Some of those same > organizations are unhappy about prohibited algorithms even being linked > into the binaries they use. These folks can use a suitably new version > of Git everywhere and not care about the lack of backwards > compatibility. > > I am, of course, in favor of abandoning SHA-1 as fast as practically > possible, but I understand that backwards compatibility is obviously a > concern. Yeah, I'd prefer them to know that they are early adopters and that they should be prepared for a situation where there's some incompatibility across versions. I don't just mean "I can't read my old SHA-1 data any more", I mean "I used Git v2.29.0 to create a SHA-256 repo and now Git v3.15.0 won't read it". (Or v2.31.0?) I've followed the work on the commit graph functionality and file format mostly from the sidelines. It's been lots of good work with lots of good outcome, but there also seems to have been (of course) a few incompatibilities, bugs and "argh, if only we'd have done it like this from the beginning". I'd assume the effort -- and potential for bugs and "ooh, we should have done it that way" -- for SHA-256--SHA-1 interoperability to be larger than what's been put into the commit graph so far. > > Workflows aside, let's consider a more technical aspect. Pack index > > files (pack-*.idx) exist in two flavours: v1 and v2. [...] [...] > > We could certainly (re)define v2 to match our SHA-256 behavior, but we > > do foresee v3 for a reason. And that would still just fix this specific > > issue. And even when everything around SHA-256 is well-defined and we > > have SHA-1--SHA-256 interoperability, there's a risk, at least > > initially, that somewhere we'd be permeating buggy data that we'd then > > feel responsible for and need to be able to handle for a long time to > > come. > > These are valid index v1 and v2 files, just with a different hash > algorithm. I claim that they are not valid, precisely because they use a different hash algorithm. > v3 is there for the point where we do interoperate and need > to store hash values of multiple algorithms at once. There's little to > no benefit to v3 if you don't need multiple algorithm support, other > than the fact that they declare the algorithms in them. One additional benefit: they'd correspond to a specification. :-) > This is no different than saying that our commit or tree objects are in > a different form; they are syntactically identical, just with a > different hash algorithm. That's how everything is in the .git > directory. For objects, I could perhaps accept that the format outlined in the hash transition document is the specification. That document says that pack indices "use a new v3 format that supports multiple hash functions." It then goes on to draft such a format. (Maybe it's a specification, but until there exists at least one implementation, I'd rather see it as a draft.) No mention of v2 pack indices with SHA-256 data, neither in that document, nor anywhere else in our documentation that I could find. The "v2 but with SHA-256" packfile index format we're producing contains lots of 32 B SHA-256s instead of 20 B SHA-1s, ok, that much can be guessed in one try. The index file ends with a 32 B SHA-256, after referencing the 32 B packfile SHA-256. Ok, maybe that could also be guessed. If we're committed to maintaining that format, we should put it down in writing. And if we're not committed to it, we should make that clear. The hash transition document foresees a packfile index format v3. Notably, it uses a _20_ B SHA-256 checksum and references a _20_ B SHA-256 packfile SHA-256. In light of that, are we certain that the "v2 with SHA-256" format outlined above is not a maintenance burden? Or that if there is any kind of cost, that it's worth it? Or, for that matter, that guessing the details of "v2 but with SHA-256" is trivial? I fully respect the effort that has gone into making the test suite run with 32 B SHA-256 instead of 20 B SHA-1. But do we really intend to support for many years to come the new file formats that such a test run produces and consumes? Bundles v3, yeah I guess so. Thanks for making that move! Pack index "v2 but with SHA-256", maybe. At the very least, we should set down our feet consciously. > > +THIS OPTION IS EXPERIMENTAL! SHA-256 support is experimental and still > > +in an early stage. A SHA-256 repository will in general not be able to > > +share work with "regular" SHA-1 repositories. It should be assumed > > +that, e.g., Git internal file formats in relation to SHA-256 > > +repositories may change in backwards-incompatible ways. Only use > > +`--object-format=sha256` for testing purposes. > > I'm fine with marking the functionality experimental for a few releases, > since it is possible we have bugs people haven't found, and adding a > note about interoperability after that point, since I think that's a > fair and valuable issue. I think if we go a few releases without any > major issues, we can change this to the following: > > Note that a SHA-256 repository cannot yet share work with "regular" > SHA-1 repositories. Many tools do not yet understand SHA-256 > repositories, so users may wish to take this into account when > creating new repositories. With respect, I think that's too aggressive. By that time, we may conclude that, e.g., the "v2 pack indices with SHA-256" file handling is robust. But I'd be surprised if using `git init --object-format=sha256` in June 2021 won't cause *some* extra work for users or ourselves further down the line compared to using a regular SHA-1 `git init`. Pushing to a SHA-1 hosting service will become *possible* at some point, but maybe it won't be *efficient enough to be practical in the real world* until some time after that. All those other, *new* file formats outlined in the hash transition document won't exist at that time (at least not in master). Now would probably be a good time to update the hash transition documents, first of all to tick off what we've already done, and second, to reassess the rest. Quoting: The first user-visible change is the introduction of the objectFormat extension (without compatObjectFormat). This requires: - implementing the loose-object-idx - teaching fsck about this mode of operation - using the hash function API (vtable) when computing object names - signing objects and verifying signatures - rejecting attempts to fetch from or push to an incompatible repository I don't think we're there yet. Maybe, e.g., the new loose-object-idx isn't strictly needed, in which case this part of the plan could be updated. (Or maybe who wrote the above thought there'd be some value in knowing that *all* SHA-256 repos *always* have loose-object-idx tables to save us from some file-discovery dancing?) We do say elsewhere in the document that Alongside the packfile, a SHA-256 repository stores a bidirectional mapping between SHA-256 and SHA-1 object names. So at the time we do not seem to be producing correct, proper, as-specified (or at least as-drafted) SHA-256 repositories. Or maybe in 2030, we can stop insisting on such a mapping, because everyone uses SHA-256 anyway, so then maybe it shouldn't be mandatory now, either. "Signing objects" is a bit vague, but under "Signed Commits", I see: [...] This means commits can be signed 1. using SHA-1 only, as in existing signed commit objects 2. using both SHA-1 and SHA-256, by using both gpgsig-sha256 and gpgsig fields. 3. using only SHA-256, by only using the gpgsig-sha256 field. Right now, we can do either 1 or 3. Maybe that's enough. I do think there's a bug in git-replace where we'll only remove the last signature, but as we'll currently only create one signature, that's perhaps "ok". I still believe we should think hard before saying (even if we only say so by omission) that, e.g., file structures are known-good and will be supported for a long time to come. Martin