Re: RFC: Separate commit identification from Merkle hashing

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Thu, 23 May 2019 23:50:24 +0200

On Thu, May 23 2019, Jonathan Nieder wrote:

> Eric S. Raymond wrote:
>> Jakub Narebski <jnareb@xxxxxxxxx>:
>
>>> Currently Git makes use of the fact that SHA-1 and SHA-256 identifiers
>>> are of different lengths to distinguish them (see section "Meaning of
>>> signatures") in Documentation/technical/hash-function-transition.txt
>>
>> That's the obvious hack.  As a future-proofing issue, though, I think
>> it would be unwise to count on all future hashes being of distinguishable
>> lengths.
>
> We're not counting on that.  As discussed in that section, future
> hashes can change the format.

I think both of you are also missing something that's implicit (but
unfortunately not very explicitly talked about) in that document, which
is that such hash transitions are assumed to have an out-of-bounds
temporal component to them.

I.e. let's assume that instead of SHA-256 we're switching to SHA-X,
which like SHA-1 is also a 20 byte hash function, so they're the same
length.

You'd then get a git.git with SHA-1 today, next year you'd have A
SHA-1<->SHA-X mapping table, but the year after that we'd be fully on
SHA-X for new content.

So even though we carry code and lookup table for looking up the old
SHA-1 values we're not going to continue to pointlessly generate that
bidirectional mapping forever. We'll have some sort of gravestone marker
where we say "past this point it's SHA-X only".

That's not implemented or specified yet, but could e.g. be a magic ref
of some sort advertised by the server, and the client would enforce that
such a marker could only be made with the stronger hash function.

Thus a couple of years after that the SHA-1 -> SHA-X transition someone
generating a colliding tag where a new good SHA-X tag *could* point to
bad SHA-1 content won't be exploitable in practice. At that point
clients won't be downloading SHA-1'd content or generating the mapping
table anymore.

So I don't see why a format change for the tags is needed, it would only
matter *if* we have a full collision *and* the hashes are the same
length (which we have no plan for), *and* if we assume we don't have
some other mitigations in play.