Re: SHA1 collisions found

Jeff King <peff@xxxxxxxx> · Thu, 23 Feb 2017 17:43:02 -0500

On Thu, Feb 23, 2017 at 02:38:29PM -0800, Linus Torvalds wrote:

> > Thanks, I hadn't seen that yet. That doesn't look like it should be hard
> > to integrate into Git.
> 
> Here's a *very* ugly patch that is absolutely disgusting and should not be 
> used. But it does kind of work (I tested it with a faked-up extra patch 
> that made git accept the broken pdf as a loose object).
> 
> What do I mean by "kind of work"? It uses that ugly and slow checking 
> SHA1 routine from the collision detection project for the SHA1 object 
> verification, and it means that "git fsck" ends up being about twice as 
> slow as it used to be.

Heh. I was just putting the finishing touches on a similar patch. Mine
is much less gross, in that it actually just adds a new USE_SHA1DC knob
(instead of, say, BLK_SHA1).

Here are the timings I came up with:

  - compute sha1 over whole packfile
    before: 1.349s
     after: 5.067s
    change: +275%

  - rev-list --all
    before: 5.742s
     after: 5.730s
    change: -0.2%

  - rev-list --all --objects
    before: 33.257s
     after: 33.392s
    change: +0.4%

  - index-pack --verify
    before: 2m20s
     after: 5m43s
    change: +145%

  - git log --no-merges -10000 -p
    before: 9.532s
     after: 9.683s
    change: +1.5%

So overall the sha1 computation is about 3-4x slower. But of
course most operations do more than just sha1. Accessing
commits and trees isn't slowed at all (both the +/- changes
there are well within the run-to-run noise). Accessing the
blobs is a little slower, but mostly drowned out by the cost
of things like actually generating patches.

The most-affected operation is `index-pack --verify`, which
is essentially just computing the sha1 on every object. It's
a bit worse than twice as slow, which means every push and
every fetch is going to experience that.

> For example, I suspect we could use our (much cleaner) block-sha1 
> implementation and include just the ubc_check.c code with that, instead of 
> the truly ugly C sha1 implementation that the sha1collisiondetection 
> project uses. 
> 
> But to do that, somebody would have to really know how the unavoidable 
> bit conditions check works with the intermediate hashes. I have only a 
> "big picture" mental model of it (read: I'm not competent to do that).

Yeah. I started looking at that, but the ubc check happens after the
initial expansion. But AFAICT, block-sha1 mixes that expansion in with
the rest of the steps for efficiency. So perhaps somebody who really
understands sha1 and the new checks could figure it out, but I'm not at
all certain that adding it in wouldn't lose some of block-sha1's
efficiency (on top of the time to actually do the ubc check).

-Peff