Re: [PATCH] Implement fast hash-collision detection

Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> · Wed, 30 Nov 2011 20:35:02 +0700

On Wed, Nov 30, 2011 at 3:59 AM, Jeff King <peff@xxxxxxxx> wrote:
> If you wanted to say "make a digest of all of the sub-objects pointed to
> by the tag", then yes, that does work (security-wise). But it's
> expensive to calculate. Instead, you want to use a "digest of digests"
> as much as possible. Which is what git already does, of course; you hash
> the tree object, which contains hashes of the blob sha1s. Git's
> conceptual model is fine. The only problem is that sha1 is potentially
> going to lose its security properties, weakening the links in the chain.
> So as much as possible, we want to insert additional links at the exact
> same places, but using a stronger algorithm.

What I'm thinking is whether it's possible to decouple two sha-1 roles
in git, as object identifier and digest, separately. Each sha-1
identifies an object and an extra set of digests on the "same" object.
Object database is extended to store all these new digests and mapping
between sha-1 and them. When we need to verify an object, given an
sha-1, we rehash that object and check the result digest with the ones
linked to the sha-1.

These new digests would be "digest of digests" just like how we use
sha-1. In fact this new digest set could be just sha-1. We just don't
hash trees/commits/tags as-is when computing these new digests. When a
tree object is hashed, for example, a new tree object with new digests
will be created for hashing (we keep sha-1 <-> new digest mapping on
disk). Think of duplicating an entire DAG with new digests as links
instead of sha-1, then we keep digests and drop the DAG.

The day sha-1 is broken, a project can generate new digests from its
old good repo and enforce developers to use new digests for
verification instead of sha-1. sha-1 is still used by git as
identifier after that day.

Computing these digests is expensive, but for local, day-to-day use,
we only need sha-1 as identifier (correct me if I'm wrong here), so we
can delay compute/store these new digests until we absolutely need
them (e.g. push/fetch). There is also an interesting case, assume
these digests are strong enough, we could replace sha-1 as identifier
in git with a cheaper hash. A new hash must fit in 160-bit space that
sha-1 takes and should have good distribution, of course. Projects
with a lot of data may like it that way.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html