Re: [PATCH] Implement fast hash-collision detection

Bill Zaumen <bill.zaumen+git@xxxxxxxxx> · Wed, 30 Nov 2011 11:00:04 -0800

[Will send a reply to Jeff's comment from last night with some 
clarifications and explanations later].

> What I'm thinking is whether it's possible to decouple two sha-1 roles
> in git, as object identifier and digest, separately. Each sha-1
> identifies an object and an extra set of digests on the "same" object.
> Object database is extended to store all these new digests and mapping
> between sha-1 and them. When we need to verify an object, given an
> sha-1, we rehash that object and check the result digest with the ones
> linked to the sha-1.

The patch I created (at least, a reasonable chunk of the code) kind of
does that:  it is very easy to change the CRC to whatever message digest
one wants.  I used a CRC primarily because I had the impression that
people were very concerned about speed, but it is easy to change that to
the message digest of your choice.  In any case, it might be a good
starting point if you want to try something in a different direction.

Basically, when you create a loose object, in addition to getting a
SHA-1 ID, you get a message digest that gets stored as well (in a
separate file). When you index a pack file, you get an IDX file
containing the SHA-1 ID  plus a corresponding MDS file containing the
message digest. Index-pack calculates the SHA-1 value from the object
stored in the pack file, and the (additional) message digest is computed
at the same time using the same data.  Commands like verify-pack check
both the IDX file and the MDS file for consistency with the matching
pack file.  The new message digest (the CRC in the patch) is used only
in cases where a repository is being altered (e.g., a loose object or
pack file is being created or a fetch, push, or pull operation) or some
explicit verification operation is running (e.g., git verify-pack).

Adding an additional header to the commit message is a good idea (I had
actually tried that, but something went wrong, although one of you
suggested what the problem might have been --- I can try again if there
is some interest in pursuing that).

It might be worth pointing out that you can use the SHA-1 hash of the
contents of objects (e.g., without the Git object header) as an
additional digest:  I tried a test using two 128-byte files with the
same MD5 hash, differing past the 20th byte, and deleted the first
four bytes of each.  With those bytes deleted, the hash collision
went away. I doubt if there is a known efficient algorithm that can
generate a hash collision for two files and for two other files that
differ from the first set by deleting N bytes from both.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html