On 14.05.2021 10:49, Ævar Arnfjörð Bjarmason wrote:
I agree insofar that I don't see a good reason for us to support some plethora of hash algorithms, but I wouldn't have objections to adding more if people find them useful for some reason. See e.g. [1] for an implementation.
I think Git should not try to do any cryptographic operations at all and rely on external tools that are implemented properly and hardended. Implementing cryptography isn't just about translating the algorithm into code but also getting memory security correct, file handling correct, input security correct, control flow correct (equal cost multi-path), etc, etc. Most of the cryptography libraries aren't designed to be misuse resistant. The only one I know of that has that as a top-line requirement is Hyperledger Ursa [1]. I would like to see us remove all cryptography code (e.g. digests, digital signatures, etc) from Git and rely on external tools entirely. If we store the cryptographic material in a self-describing format that identifies the associated tool as well as the cryptographic data, then Git can be completely agnostic.
But I really don't see how anything you've said would present a technical hurdle once we have SHA-1<->SHA-256 interop in a good enough state. At that point we'll support re-hashing on arrival of content hashed with algorithm X into Y, with a local lookup table between X<=>Y. So if somebody wants to maintain content hashed with algorithm Z locally we should easily be able to support that. The "diversity of naming" won't matter past that local repository, any mention of Z will be translated to X or Y on fetch/push.
Using self-describing formats allows us to honor history and keep old object names as they and eliminate all of this added complications you describe. I think there is a lot of room for errors to creep in when collaborators have copies of the same repo and they have local mappings between different hashing algorithms. How is this not setting up for a combinatorial explosion of data? If the canonical repo uses SHA1 and one contributor uses SHA2-512, another uses Blake2b-256, and yet another uses SHA3-384, won't they all have to maintain six different translation tables for all objects? SHA1 <=> SHA2-512, SHA1 <=> Blake2b-256, SHA1 <=> SHA3-384, SHA2-512 <=> Blake2b-256, SHA2-512 <=> SHA3-384, and Blake2b-256 <=> SHA3-384? I guess that's your motivation for not allowing algorithmic agility. The way around this is to use self-describing formats and external tools. Git repo copies wouldn't be required to have only *one* algorithm naming all objects, requiring the translation tables. Instead Git repos would/could have heterogeneous object names, each one with a single name generated with a different digest algorithm. Git would simply consider those names as plain strings and validating those strings requires talking to the correct external tool, sending the name string and the object data and reading back the result. I think this is a much better approach because: 1. It creates algorithmic agility in a way that isn't top-down and heavy handed. 2. It eliminates the need for all of the translation tables and round-tripping complexity. 3. It empowers maintainers to decide which algorithms can/must be used when naming objcts in a given repo. Merge hooks, CI/CD checks and etiquette guides can be used to enforce this. 4. Git's attack surface becomes smaller (a very good thing) and limited to doing IPC to external tools correctly and securely (easy) instead of trying to get cryptography client code correct (very difficult). One other thing to consider is that there are new tools being developed that do similar things as Git that do have algorithmic agility and use self-describing cryptographic primitives. Late-binding trust is now a best practice and has been for quite some time. Many people rely upon Git and I think we should keep up with the best practices. Cheers! Dave