Re: Is the sha256 object format experimental or not?

dwh@xxxxxxxxxxxxxxxxxxx · Fri, 14 May 2021 11:10:09 -0700

On 14.05.2021 10:49, Ævar Arnfjörð Bjarmason wrote:
I agree insofar that I don't see a good reason for us to support some
plethora of hash algorithms, but I wouldn't have objections to adding
more if people find them useful for some reason. See e.g. [1] for an
implementation.

I think Git should not try to do any cryptographic operations at all and
rely on external tools that are implemented properly and hardended.
Implementing cryptography isn't just about translating the algorithm
into code but also getting memory security correct, file handling
correct, input security correct, control flow correct (equal cost
multi-path), etc, etc. Most of the cryptography libraries aren't
designed to be misuse resistant. The only one I know of that has that as
a top-line requirement is Hyperledger Ursa [1].

I would like to see us remove all cryptography code (e.g. digests,
digital signatures, etc) from Git and rely on external tools entirely.
If we store the cryptographic material in a self-describing format that
identifies the associated tool as well as the cryptographic data, then
Git can be completely agnostic.

But I really don't see how anything you've said would present a
technical hurdle once we have SHA-1<->SHA-256 interop in a good enough
state. At that point we'll support re-hashing on arrival of content
hashed with algorithm X into Y, with a local lookup table between X<=>Y.

So if somebody wants to maintain content hashed with algorithm Z locally
we should easily be able to support that. The "diversity of naming"
won't matter past that local repository, any mention of Z will be
translated to X or Y on fetch/push.

Using self-describing formats allows us to honor history and keep old
object names as they and eliminate all of this added complications you
describe. I think there is a lot of room for errors to creep in when
collaborators have copies of the same repo and they have local mappings
between different hashing algorithms. How is this not setting up for a
combinatorial explosion of data? If the canonical repo uses SHA1 and one
contributor uses SHA2-512, another uses Blake2b-256, and yet another
uses SHA3-384, won't they all have to maintain six different translation
tables for all objects? SHA1 <=> SHA2-512, SHA1 <=> Blake2b-256, SHA1
<=> SHA3-384, SHA2-512 <=> Blake2b-256, SHA2-512 <=> SHA3-384, and
Blake2b-256 <=> SHA3-384? I guess that's your motivation for not
allowing algorithmic agility.

The way around this is to use self-describing formats and external
tools. Git repo copies wouldn't be required to have only *one* algorithm
naming all objects, requiring the translation tables. Instead Git repos
would/could have heterogeneous object names, each one with a single name
generated with a different digest algorithm. Git would simply consider
those names as plain strings and validating those strings requires
talking to the correct external tool, sending the name string and the
object data and reading back the result.

I think this is a much better approach because:

1. It creates algorithmic agility in a way that isn't top-down and heavy
handed.

2. It eliminates the need for all of the translation tables and
round-tripping complexity.

3. It empowers maintainers to decide which algorithms can/must be used
when naming objcts in a given repo. Merge hooks, CI/CD checks and
etiquette guides can be used to enforce this.

4. Git's attack surface becomes smaller (a very good thing) and limited
to doing IPC to external tools correctly and securely (easy) instead of
trying to get cryptography client code correct (very difficult).

One other thing to consider is that there are new tools being developed
that do similar things as Git that do have algorithmic agility and use
self-describing cryptographic primitives. Late-binding trust is now a
best practice and has been for quite some time. Many people rely upon
Git and I think we should keep up with the best practices.

Cheers!
Dave