Re: Is the sha256 object format experimental or not?

dwh@xxxxxxxxxxxxxxxxxxx · Thu, 13 May 2021 16:26:14 -0700

On 14.05.2021 06:03, Junio C Hamano wrote:
dwh@xxxxxxxxxxxxxxxxxxx writes:

I think Git should externalize the calculation of object digests just
like it externalizes the calcualtion of object digital signatures.

The hashing algorithms used to generate object names has
requirements fundamentally different from that of digital
signatures.  I strongly suspect that that fact would change the
equation when you rethink what you said above.

I agree with you. Object names are exactly that: names. Names for
resources/data must be persistent, as well as global in scope and
uniqueness, and autonomously assigned. What this means is that once an
object has a name, that name shall never change as long as the object
remains unchanged. The names must be unique in the scope of all objects
(e.g. all copies of a repo) and generated without coordination.

Calculating object names using a digest algorithm meets all of these
requirements. Choosing a strong digest algorithm creates a strong
cryptographic binding between the name and the object contents. Using
self-describing digests allows for a repo to switch digest algorithms at
arbitrary points in the history.

I think that objects named with SHA1 digests should remain named with
the SHA1 digest. I do *not* advocate going back and rewriting history
to change all of the object names to a digest with a different
algorithm. Git is a provenance log and history matters. I recommend
preserving all existing names, even if they were created with known-weak
digest algorithms, and making the change to a new algorithm at a
specific point in time (e.g. at a tag). Using self-describing digest
encoding and externalizing digest calculation future-proofs
repositories and allows for preservation of history while allowing
algorithm agility.

To illustrate my point, I envision that a repos could have a history
like this:

object 2923f6fa36614586ea09b4424b438915cc1b9b67 (naked SHA1)
 |
<many objects named with SHA1>
 |
object 5f167fb6b3e96273b564fff0b041fb94fee4d3de (naked SHA1)
 |
<modify Git to ext. digest calculation and self-desc encoding>
 |
object 98c2e1c0965e60b0f137577ac5dd0a5c96ce224d (naked SHA1)
 |
<many objects named with SHA1>
 |
<a project decides to switch to SHA2-256, maybe marked in a tag>
 |
object IAOdLVxteOxQwKa-xn8yCBUkuPkjAqcuQ2V7fKAlao8o (self-desc.SHA2-256)
 |
<many objects named with self-describing SHA2-256 digests>
 |
<a project decices to switch to SHA3-256, maybe marked in a tag>
 |
object EK832G0PFhBFf-Dfgr205UKpUMqmVXJX9ltLwQo4Awct (self-desc.SHA3-256)
 |
<many objects named with self-descring SHA3-256 digests>
 .
 .
 .

Neither decision to switch to SHA2-256 nor to SHA3-256 would require any
code changes. If we continue down the current SHA-256 road, we will have
to repeat that multi-year effort in the future to switch to SHA3 or
something else. Most importantly, the choice of digest algorithm would
be left up to the maintainers of a given repo and not limited to the
algorithms we have hard coded into Git.

Brian's work on the SHA-256 switch is valuable. We can leverage a lot of
it to switch to externalized digest calculation and self-describing
digests and never have to worry about doing that again.

Cheers!
Dave