Re: [PATCH 0/4] hash.h: support choosing a separate SHA-1 for non-cryptographic uses

"brian m. carlson" <sandals@xxxxxxxxxxxxxxxxxxxx> · Mon, 2 Sep 2024 14:08:25 +0000

On 2024-09-01 at 16:03:15, Taylor Blau wrote:
> This series adds a build-time knob to allow selecting an alternative
> SHA-1 implementation for non-cryptographic hashing within Git, starting
> with the `hashwrite()` family of functions.
> 
> This series is the result of starting to roll out verbatim multi-pack
> reuse within GitHub's infrastructure. I noticed that on larger
> repositories, it is harder thus far to measure a CPU speed-up on clones
> where multi-pack reuse is enabled.
> 
> After some profiling, I noticed that we spend a significant amount of
> time in hashwrite(), which is not all that surprising. But much of that
> time is wasted in GitHub's infrastructure, since we are using the same
> collision-detecting SHA-1 implementation to produce a trailing checksum
> for the pack which does not need to be cryptographically secure.

Hmm, I'm not sure this is the case.  Let's consider the case where SHA-1
becomes as easy to collide as MD4, which requires less than 2 hash
operations for a collision, in which case we can assume that it's
trivial, because eventually we expect that will happen with advances in
technology.

So in that case, we believe that an attacker who knows what's in a pack
file and can collide one or more of the objects can create another
packfile with a different, colliding object and cause the pack contents
to be the same.  Because we use the pack file hash as the name of the
pack and we use rename(2), which ignores whether the destination exists,
that means we have to assume that eventually an attacker will be able to
overwrite one pack file with another with different contents without
being detected simply by pushing a new pack into the repository.

Even if we assume that SHA-1 attacks only become as easy as MD5 attacks,
the RapidSSL exploits[0][1] demonstrate that an attacker can create
collisions based on predictable outputs even with imprecise feedback. We
know SHA-1 is quite weak, so it could actually be quite soon that
someone finds an improvement in attacking the algorithm.  Note that
computing 2^56 DES keys by brute force costs about $20 from cloud
providers[2], and SHA-1 provides only 2^61 collision security, so a
small improvement would probably make this pretty viable to attack on
major providers with dedicated hardware.

This is actually worse on some providers where the operations tend to be
single threaded.  In those situations, there is no nondeterminism from
threads to make packs slightly different, and thus it would be extremely
easy to create such collisions predictably based on what an appropriate
upstream version of Git does.

So I don't think we can accurately say that cryptographic security isn't
needed here.  If we need a unique name that an attacker cannot control
and there are negative consequences (such as data loss) from the
attacker being able to control the name, then we need cryptographic
security here, which would imply a collision-detecting SHA-1 algorithm.

We could avoid this problem if we used link(2) and unlink(2) to avoid
renaming over existing pack files, though, similar to loose objects.
We'd need to not use the new fast algorithm unless core.createObject is
set to "link", though.

[0] https://en.wikipedia.org/wiki/MD5
[1] https://www.win.tue.nl/hashclash/rogue-ca/
[2] https://crack.sh/
-- 
brian m. carlson (they/them or he/him)
Toronto, Ontario, CA
Attachment:
signature.asc

Description: PGP signature