On Fri, Feb 24, 2017 at 4:39 PM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > Honestly, I think that a primary goal for a new hash implementation > absolutely needs to be to minimize mixing. > > Not for security issues, but because of combinatorics. You want to > have a model that basically reads old data, but that very aggressively > approaches "new data only" in order to avoid the situation where you > have basically the exact same tree state, just _represented_ > differently. > > For example, what I would suggest the rules be is something like this: Hmm. Having looked at this a fair amount, and in particularly having looked at the code as part of the hash typesafety patch I did, I am actually starting to think that it would not be too painful at all to have a totally different approach, which might be a lot easier to do. So bear with me, let me try to explain my thinking: (a) if we want to be backwards compatible and not force people to convert their trees on some flag day, we're going to be stuck with having to have the SHA1 code around and all the existing object parsing for basically forever Now, this basically means that the _easy_ solution would be that we just do the flag day, switch to sha-256, extend everything to 32-byte hashes, and just have a "git2 fast-import" that makes it easy to convert stuff. But it really would be a completely different version of git, with a new pack-file format and no real compatibility. Such a flag-day approach would certainly have advantages: it would allow for just re-architecting some bad choices: - make the hashing be something that can be threaded (ie maybe we can just block it up in 4MB chunks that you can hash in parallel, and make the git object hash be the hash of hashes) - replace zlib with something like zstd - get rid of old pack formats etc. but on the whole, I still think that the compatibility would be worth much more than the possible technical advantages of a clean slate restart. (b) the SHA1 hash is actually still quite strong, and the collision detection code effectively means that we don't really have to worry about collisions in the immediate future. In other words, the mitigation of the current attack is actually really easy technically (modulo perhaps the performance concerns), and there's still nothing fundamentally wrong with using SHA1 as a content hash. It's still a great hash. Now, my initial reaction (since it's been discussed for so long anyway) was obviously "pick a different hash". That was everybody's initial reaction, I think. But I'm starting to think that maybe that initial obvious reaction was wrong. The thing that makes collision attacks so nasty is that our reaction to a collision is so deadly. But that's not necessarily fundamental: we certainly uses hashes with collisions every day, and they work fine. And they work fine because the code that uses those hashes is designed to simply deal gracefully - although very possibly with a performance degradation - with two different things hashing to the same bucket. So what if the solution to "SHA1 has potential collisions" is "any hash can have collisions in theory, let's just make sure we handle them gracefully"? Because I'm looking at our data structures that have hashes in them, and many of them are actually of the type where I go "Hmm.. Replacing the hash entirely is really really painful - but it wouldn't necessarily be all that nasty to extend the format to have additional version information". and the advantage of that approach is that it actually makes the compatibility part trivial. No more "have to pick a different hash and two different formats", and more of a "just the same format with extended information that might not be there for old objects". So we have a few different types of formats: - the purely local data structures: the pack index file, the file index, our refs etc These we could in change completely, and it wouldn't even be all that painful. The pack index has already gone through versions, and it doesn't affect anything else. - the actual objects. These are fairly painful to change, particularly things like the "tree" object which is probably the worst designed of the lot. Making it contain a fixed-size binary thing was really a nasty mistake. My bad. - the pack format and the protocol to exchange "I have this" information This is *really* painful to change, because it contains not just the raw object data, but it obviously ends up being the wire format for remote accesses. and it turns out that *all* of these formats look like they would be fairly easy to extend to having extra object version information. Some of that extra object version information we already have and don't use, in fact. Even the tree format, with the annoying fixed-size binary blob. Yes, it has that fixed size binary blob, but it has it in _addition_ to the ASCII textual form that would be really easy to just extend upon. We have that "tree entry type" that we've already made extensions with by using it for submodules. It would be quite easy to just say that a tree entry also has a "file version" field, so that you can have multiple objects that just hash to the same SHA1, and git wouldn't even *care*. The transfer protocol is the same: yes, we transfer hashes around, but it would not be all that hard to extend it to "transfer hash and object version". And the difference is that then the "backwards compatibility" part just means interacting with somebody who didn't know to transfer the object version. So suddenly being backwards compatible isn't a whole different object parsing thing, it's just a small extension. IOW, we could make it so that the SHA1 is just a hash into a list of objects. Even the pack index format wouldn't need to change - right now we assume that an index hit gives us the direct pointer into the pack file, but we *could* just make it mean that it gives us a direct pointer to the first object in the pack file with that SHA1 hash. Exactly like you'd normally use a hash table with linear probing. Linear probing is usually considered a horrible approach to hash tables,. but it's actually a really useful one for the case where collisions are very rare. Anyway, I do have a suggestion for what the "object version" would be, but I'm not even going to mention it, because I want people to first think about the _concept_ and not the implementation. So: What do you think about the concept? Linus