On Thu, Mar 2, 2017 at 12:43 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote: > > My reaction heavily depends on how that "object version" thing > works. > > Would "object version" be like a truncated SHA-1 over the same data > but with different IV or something, i.e. something that guarantees > anybody would get the same result given the data to be hashed? Yes, it does need to be that in practice. So what I was thinking the object version would be is: (a) we actually take the object type into account explicitly. (b) we explicitly add another truncated hash. The first part we can already do without any actual data structure changes, since basically all users already know the type of an object when they look it up. So we already have information that we could use to narrow down the hash collision case if we saw one. There are some (very few) cases where we don't already explicitly have the object type (a tag reference can be any object, for example, and existing scripts might ask for "give me the type of this SHA1 object with "git cat-file -t"), but that just goes back to the whole "yeah, we'll handle legacy uses and we will look up objects even _without_ the extra version data, so it actually integrates well into the whole notion. Basically, once you accept that "hey, we'll just have a list of objects with that hash", it just makes sense to narrow it down by the object type we also already have. But yes, the object type is obviously only two bits of information (actually, considering the type distribution, probably just one bit), and it's already encoded in the first hash, so it doesn't actually help much as "collision avoidance" particularly once you have a particular attack against that hash in place. It's just that it *is* extra information that we already have, and that is very natural to use once you start thinking of the hash lookup as returning a list of objects. It also mitigates one of the worst _confusions_ in git, and so basically mitigates the worst-case downside of an attack basically for free, so it seems like a no-brainer. But the real new piece of object version would be a truncated second hash of the object. I don't think it matters too much what that second hash is, I would say that we'd just approximate having a total of 256 bits of hash. Since we already have basically 160 bits of fairly good hashing, and roughly 128 bits of that isn't known to be attackable, we'd just use another hash and truncate that to 128 bits. That would be *way* overkill in practice, but maybe overkill is what we want. And it wouldn't really expand the objects all that much more than just picking a new 256-bit hash would do. So you'd have to be able to attack both the full SHA1, _and_ whatever other different good hash to 128 bits. Linus PS. if people think that SHA1 is of a good _size_, and only worry about the known weaknesses of the hashing itself, we'd only need to get back the bits that the attacks take away from brute force. That's currently the 80 -> ~63 bits attack, so you'd really only want about 40 bits of second hash to claw us back back up to 80 bits of brute force (again: brute force is basically sqrt() of the search space, so half the bits, so adding 40 bits of hash adds 20 bits to the brute force cost and you'd get back up to the 2**80 we started with). So 128 bits of secondary hash really is much more than we'd need. 64 bits would probably be fine.