Re: SHA1 collisions found

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Fri, 24 Feb 2017 16:39:45 -0800

On Fri, Feb 24, 2017 at 3:39 PM, Jeff King <peff@xxxxxxxx> wrote:
>
> One thing I worry about in a mixed-hash setting is how often the two
> will be mixed.

Honestly, I think that a primary goal for a new hash implementation
absolutely needs to be to minimize mixing.

Not for security issues, but because of combinatorics. You want to
have a model that basically reads old data, but that very aggressively
approaches "new data only" in order to avoid the situation where you
have basically the exact same tree state, just _represented_
differently.

For example, what I would suggest the rules be is something like this:

 - introduce new tag2/commit2/tree2/blob2 object type tags that imply
that they were hashed using the new hash

 - an old type obviously can never contain a pointer to a new type (ie
you can't have a "tree" object that contains a tree2 object or a blob2
object.

 - but also make the rule that a *new* type can never contain a
pointer to an old type, with the *very* specific exception that a
commit2 can have a parent that is of type "commit".

That way everything "converges" towards the new format: the only way
you can stay on the old format is if you only have old-format objects,
and once you have a new-format object all your objects are going to be
new format - except for the history.

Obviously, if somebody stays in old format, you might end up still
getting some object duplication when you continue to merge from him,
but that tree can never merge back without converting to new-format,
so it will be a temporary situation.

So you will end up with duplicate objects, and that's not good (think
of what it does to all our full-tree "diff" optimizations, for example
- you no longer get the "these sub-trees are identical" across a
format change), but realistically you'll have a very limited time of
that kind of duplication.

I'd furthermore suggest that from a UI standpoint, we'd

 - convert to 64-character hex numbers (32-byte hashes)

 - (as mentioned earlier) default to a 40-character abbreviation

 - make the old 40-character SHA1's just show up within the same
address space (so they'd also be encoded as 32-byte hashes, just with
the last 12 bytes zero).

 - you'd see in the "object->type" whether it's a new or old-style hash.

I suspect it shouldn't be too painful to do it that way.

                Linus