Re: SHA1 collisions found

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Thu, 2 Mar 2017 11:55:45 -0800

On Fri, Feb 24, 2017 at 4:39 PM, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> Honestly, I think that a primary goal for a new hash implementation
> absolutely needs to be to minimize mixing.
>
> Not for security issues, but because of combinatorics. You want to
> have a model that basically reads old data, but that very aggressively
> approaches "new data only" in order to avoid the situation where you
> have basically the exact same tree state, just _represented_
> differently.
>
> For example, what I would suggest the rules be is something like this:

Hmm. Having looked at this a fair amount, and in particularly having
looked at the code as part of the hash typesafety patch I did, I am
actually starting to think that it would not be too painful at all to
have a totally different approach, which might be a lot easier to do.

So bear with me, let me try to explain my thinking:

 (a) if we want to be backwards compatible and not force people to
convert their trees on some flag day, we're going to be stuck with
having to have the SHA1 code around and all the existing object
parsing for basically forever

Now, this basically means that the _easy_ solution would be that we
just do the flag day, switch to sha-256, extend everything to 32-byte
hashes, and just have a "git2 fast-import" that makes it easy to
convert stuff.

But it really would be a completely different version of git, with a
new pack-file format and no real compatibility. Such a flag-day
approach would certainly have advantages: it would allow for just
re-architecting some bad choices:

 - make the hashing be something that can be threaded (ie maybe we can
just block it up in 4MB chunks that you can hash in parallel, and make
the git object hash be the hash of hashes)

 - replace zlib with something like zstd

 - get rid of old pack formats etc.

but  on the whole, I still think that the compatibility would be worth
much more than the possible technical advantages of a clean slate
restart.

 (b) the SHA1 hash is actually still quite strong, and the collision
detection code effectively means that we don't really have to worry
about collisions in the immediate future.

In other words, the mitigation of the current attack is actually
really easy technically (modulo perhaps the performance concerns), and
there's still nothing fundamentally wrong with using SHA1 as a content
hash. It's still a great hash.

Now, my initial reaction (since it's been discussed for so long
anyway) was obviously "pick a different hash". That was everybody's
initial reaction, I think.

But I'm starting to think that maybe that initial obvious reaction was wrong.

The thing that makes collision attacks so nasty is that our reaction
to a collision is so deadly.  But that's not necessarily fundamental:
we certainly uses hashes with collisions every day, and they work
fine. And they work fine because the code that uses those hashes is
designed to simply deal gracefully - although very possibly with a
performance degradation - with two different things hashing to the
same bucket.

So what if the solution to "SHA1 has potential collisions" is "any
hash can have collisions in theory, let's just make sure we handle
them gracefully"?

Because I'm looking at our data structures that have hashes in them,
and many of them are actually of the type where I go

  "Hmm..  Replacing the hash entirely is really really painful - but
it wouldn't necessarily be all that nasty to extend the format to have
additional version information".

and the advantage of that approach is that it actually makes the
compatibility part trivial. No more "have to pick a different hash and
two different formats", and more of a "just the same format with
extended information that might not be there for old objects".

So we have a few different types of formats:

 - the purely local data structures: the pack index file, the file
index, our refs etc

   These we could in change completely, and it wouldn't even be all
that painful. The pack index has already gone through versions, and it
doesn't affect anything else.

 - the actual objects.

   These are fairly painful to change, particularly things like the
"tree" object which is probably the worst designed of the lot. Making
it contain a fixed-size binary thing was really a nasty mistake. My
bad.

 - the pack format and the protocol to exchange "I have this" information

   This is *really* painful to change, because it contains not just
the raw object data, but it obviously ends up being the wire format
for remote accesses.

and it turns out that *all* of these formats look like they would be
fairly easy to extend to having extra object version information. Some
of that extra object version information we already have and don't
use, in fact.

Even the tree format, with the annoying fixed-size binary blob. Yes,
it has that fixed size binary blob, but it has it in _addition_ to the
ASCII textual form that would be really easy to just extend upon. We
have that "tree entry type" that we've already made extensions with by
using it for submodules. It would be quite easy to just say that a
tree entry also has a "file version" field, so that you can have
multiple objects that just hash to the same SHA1, and git wouldn't
even *care*.

The transfer protocol is the same: yes, we transfer hashes around, but
it would not be all that hard to extend it to "transfer hash and
object version".

And the difference is that then the "backwards compatibility" part
just means interacting with somebody who didn't know to transfer the
object version. So suddenly being backwards compatible isn't a whole
different object parsing thing, it's just a small extension.

IOW, we could make it so that the SHA1 is just a hash into a list of
objects. Even the pack index format wouldn't need to change - right
now we assume that an index hit gives us the direct pointer into the
pack file, but we *could* just make it mean that it gives us a direct
pointer to the first object in the pack file with that SHA1 hash.
Exactly like you'd normally use a hash table with linear probing.

Linear probing is usually considered a horrible approach to hash
tables,. but it's actually a really useful one for the case where
collisions are very rare.

Anyway, I do have a suggestion for what the "object version" would be,
but I'm not even going to mention it, because I want people to first
think about the _concept_ and not the implementation.

So: What do you think about the concept?

               Linus