Re: [PATCH 0/4] hash.h: support choosing a separate SHA-1 for non-cryptographic uses

Taylor Blau <me@xxxxxxxxxxxx> · Tue, 3 Sep 2024 15:47:39 -0400

On Mon, Sep 02, 2024 at 02:08:25PM +0000, brian m. carlson wrote:
> On 2024-09-01 at 16:03:15, Taylor Blau wrote:
> > This series adds a build-time knob to allow selecting an alternative
> > SHA-1 implementation for non-cryptographic hashing within Git, starting
> > with the `hashwrite()` family of functions.
> >
> > This series is the result of starting to roll out verbatim multi-pack
> > reuse within GitHub's infrastructure. I noticed that on larger
> > repositories, it is harder thus far to measure a CPU speed-up on clones
> > where multi-pack reuse is enabled.
> >
> > After some profiling, I noticed that we spend a significant amount of
> > time in hashwrite(), which is not all that surprising. But much of that
> > time is wasted in GitHub's infrastructure, since we are using the same
> > collision-detecting SHA-1 implementation to produce a trailing checksum
> > for the pack which does not need to be cryptographically secure.
>
> Hmm, I'm not sure this is the case.  Let's consider the case where SHA-1
> becomes as easy to collide as MD4, which requires less than 2 hash
> operations for a collision, in which case we can assume that it's
> trivial, because eventually we expect that will happen with advances in
> technology.

I'm not sure this attack is possible as you described.

We still run any packs through index-pack before landing them in
$GIT_DIR/objects/pack, and index-pack still uses the collision-detecting
SHA-1 implementation (if the repository uses SHA-1 and Git was compiled
with it).

So if I were a malicious attacker trying to compromise data on a forge,
I would have to first (a) know the name of some pack that I was trying
to collide, then (b) create a pack which collides with that one before
actually pushing it. (b) seems difficult to impossible to execute
(certainly today, maybe ever) because the attacker only controls the
object contents within the pack, but can't adjust the pack header,
object headers, compression, etc.

But even if the attacker could do all of that, the remote still needs to
index that pack, and while checksumming the pack, it would notice the
collision (or SHA-1 mismatch) and reject the pack by die()-ing either
way. (AFAICT, this all happens in
builtin/index-pack.c::parse_pack_objects()).

> So in that case, we believe that an attacker who knows what's in a pack
> file and can collide one or more of the objects can create another
> packfile with a different, colliding object and cause the pack contents
> to be the same.  Because we use the pack file hash as the name of the
> pack and we use rename(2), which ignores whether the destination exists,
> that means we have to assume that eventually an attacker will be able to
> overwrite one pack file with another with different contents without
> being detected simply by pushing a new pack into the repository.

Right... but I think we would die() before we attempt to rename() the
pack into place as above.

Thanks,
Taylor