Re: SHA1 collisions found

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Thu, 2 Mar 2017 13:21:30 -0800

On Thu, Mar 2, 2017 at 12:43 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
>
> My reaction heavily depends on how that "object version" thing
> works.
>
> Would "object version" be like a truncated SHA-1 over the same data
> but with different IV or something, i.e. something that guarantees
> anybody would get the same result given the data to be hashed?

Yes, it does need to be that in practice. So what I was thinking the
object version would be is:

 (a) we actually take the object type into account explicitly.

 (b) we explicitly add another truncated hash.

The first part we can already do without any actual data structure
changes, since basically all users already know the type of an object
when they look it up.

So we already have information that we could use to narrow down the
hash collision case if we saw one.

There are some (very few) cases where we don't already explicitly have
the object type (a tag reference can be any object, for example, and
existing scripts might ask for "give me the type of this SHA1 object
with "git cat-file -t"), but that just goes back to the whole "yeah,
we'll handle legacy uses and we will look up objects even _without_
the extra version data, so it actually integrates well into the whole
notion.

Basically, once you accept that "hey, we'll just have a list of
objects with that hash", it just makes sense to narrow it down by the
object type we also already have.

But yes, the object type is obviously only two bits of information
(actually, considering the type distribution, probably just one bit),
and it's already encoded in the first hash, so it doesn't actually
help much as "collision avoidance" particularly once you have a
particular attack against that hash in place.

It's just that it *is* extra information that we already have, and
that is very natural to use once you start thinking of the hash lookup
as returning a list of objects. It also mitigates one of the worst
_confusions_ in git, and so basically mitigates the worst-case
downside of an attack basically for free, so it seems like a
no-brainer.

But the real new piece of object version would be a truncated second
hash of the object.

I don't think it matters too much what that second hash is, I would
say that we'd just approximate having a total of 256 bits of hash.

Since we already have basically 160 bits of fairly good hashing, and
roughly 128 bits of that isn't known to be attackable, we'd just use
another hash and truncate that to 128 bits. That would be *way*
overkill in practice, but maybe overkill is what we want. And it
wouldn't really expand the objects all that much more than just
picking a new 256-bit hash would do.

So you'd have to be able to attack both the full SHA1, _and_ whatever
other different good hash to 128 bits.

                Linus

PS.  if people think that SHA1 is of a good _size_, and only worry
about the known weaknesses of the hashing itself, we'd only need to
get back the bits that the attacks take away from brute force. That's
currently the 80 -> ~63 bits attack, so you'd really only want about
40 bits of second hash to claw us back back up to 80 bits of brute
force (again: brute force is basically sqrt() of the search space, so
half the bits, so adding 40 bits of hash adds 20 bits to the brute
force cost and you'd get back up to the 2**80 we started with).

So 128 bits of secondary hash really is much more than we'd need. 64
bits would probably be fine.