Re: SHA1 collisions found

Junio C Hamano <gitster@xxxxxxxxx> · Fri, 24 Feb 2017 09:32:13 -0800

Ian Jackson <ijackson@xxxxxxxxxxxxxxxxxxxxxx> writes:

> I have been thinking about how to do a transition from SHA1 to another
> hash function.

Good.  I think many of us have also been, too, not necessarily just
in the past few days in response to shattered, but over the last 10
years, yet without coming to a consensus design ;-)

> I have concluded that:
>
>  * We can should avoid expecting everyone to rewrite all their
>    history.

Yes.

>  * Unfortunately, because the data formats (particularly, the commit
>    header) are not in practice extensible (because of the way existing
>    code parses them), it is not useful to try generate new data (new
>    commits etc.) containing both new hashes and old hashes: old
>    clients will mishandle the new data.

Yes.

>  * Therefore the transition needs to be done by giving every object
>    two names (old and new hash function).  Objects may refer to each
>    other by either name, but must pick one.  The usual shape of

I do not think it is necessrily so.  Existing code may not be able
to read anything new, but you can make the new code understand
object names in both formats, and for a smooth transition, I think
the new code needs to.

For example, a new commit that records a merge of an old and a new
commit whose resulting tree happens to be the same as the tree of
the old commit may begin like so:

    tree 21b97d4c4f968d1335f16292f954dfdbb91353f0
    parent 20769079d22a9f8010232bdf6131918c33a1bf6910232bdf6131918c33a1bf69
    parent 22af6fef9b6538c9e87e147a920be9509acf1ddd

naming the only object whose name was done with new hash with the
new longer hash, while recording the names of the other existing
objects with SHA-1.  We would need to extend the object format for
tag (which would be trivial as the object reference is textual and
similar to a commit) and tree (much harder), of course.

As long as the reader can tell from the format of object names
stored in the "new object format" object from what era is being
referred to in some way [*1*], we can name new objects with only new
hash, I would think.  "new refers only to new" that stratifies
objects into older and newer may make things simpler, but I am not
convinced yet that it would give our users a smooth enough
transition path (but I am open to be educated and pursuaded the
other way).

[Footnote]

*1* In the above toy example, length being 40 vs 64 is used as a
    sign between SHA-1 and the new hash, and careful readers may
    wonder if we should use sha-3,20769079d22... or something like
    that that more explicity identifies what hash is used, so that
    we can pick a hash whose length is 64 when we transition again.

    I personally do not think such a prefix is necessary during the
    first transition; we will likely to adopt a new hash again, and
    at that point that third one can have a prefix to differenciate
    it from the second one.