Re: [PATCH 0/4] hash.h: support choosing a separate SHA-1 for non-cryptographic uses

Jeff King <peff@xxxxxxxx> · Thu, 5 Sep 2024 06:37:36 -0400

On Tue, Sep 03, 2024 at 03:47:39PM -0400, Taylor Blau wrote:

> > Hmm, I'm not sure this is the case.  Let's consider the case where SHA-1
> > becomes as easy to collide as MD4, which requires less than 2 hash
> > operations for a collision, in which case we can assume that it's
> > trivial, because eventually we expect that will happen with advances in
> > technology.
> 
> I'm not sure this attack is possible as you described.
> 
> We still run any packs through index-pack before landing them in
> $GIT_DIR/objects/pack, and index-pack still uses the collision-detecting
> SHA-1 implementation (if the repository uses SHA-1 and Git was compiled
> with it).

I agree this is not a problem with your series as it is, but I think in
the long run we'd want to switch over index-pack, too, for two reasons:

  1. It could benefit from the same speedups on the receiving side that
     your patches give on the sending side (though the difference is
     less noticeable there, because we're also hashing the expanded
     contents).

  2. We'll have to do so if switch to a non-cryptographic hash (as is
     discussed elsewhere in this thread), since the two obviously have
     to match.

So let's imagine for a moment that we make that change.

I don't think you can smuggle any duplicate objects this way. We'll
still index each individual object using sha1dc, so any attempts to
collide there will be caught. You'd need totally different objects that
are in a packfile that happens to hash to the same checksum. And then
since the receiver is the one computing the object id of those objects,
they won't match (modulo some racing, which I'll get to).

But I do think you could do a denial-of-service attack by corrupting the
receiving repo. Imagine that:

  1. I (somehow) know you have a pack with hash XYZ, and thus that
     you'll be storing objects/pack/pack-XYZ.pack.

  2. I generate a new, valid pack that contains 100 random objects
     (or enough to defeat transfer.unpackLimit). I mutate that objects'
     contents until my new pack also has hash XYZ.

  3. I send you that pack (e.g., via push if you're a server, or by
     manipulating you into fetch if you're a client). You index the
     pack, see that it should be called pack-XYZ.pack, and then rename()
     it on top of your existing file.

Now you've just lost access to all of those objects, and your repo is
broken.

We should be able to simulate this in practice. First, let's weaken the
"fast" sha1 such that it retains only the final two bits:

diff --git a/sha1/openssl.h b/sha1/openssl.h
index 1038af47da..f0d5c59c43 100644
--- a/sha1/openssl.h
+++ b/sha1/openssl.h
@@ -32,6 +32,8 @@ static inline void openssl_SHA1_Final(unsigned char *digest,
 {
 	EVP_DigestFinal_ex(ctx->ectx, digest, NULL);
 	EVP_MD_CTX_free(ctx->ectx);
+	memset(digest, 0, 19);
+	digest[19] &= 0x3;
 }
 
 static inline void openssl_SHA1_Clone(struct openssl_SHA1_CTX *dst,

Now we can build with OPENSSL_SHA1_FAST=1, and use the result (which I
call git.compile below) to repack some victim repo:

  # use regular Git to clone first
  git clone --no-local --bare /some/repo victim.git

  # and now use our weak hash to repack; you should end up with
  # pack-000000...003.pack or similar
  git.compile -C victim.git repack -ad

We can even fsck it, if we teach the fsck code to use our weak hash (as
an aside, I think it is weird that fsck has its own hash verification;
it should probably be relying on verify-pack for this, but I haven't dug
into this in a while and IIRC there are some complications):

diff --git a/pack-check.c b/pack-check.c
index e883dae3f2..1b80616c70 100644
--- a/pack-check.c
+++ b/pack-check.c
@@ -67,7 +67,7 @@ static int verify_packfile(struct repository *r,
 	if (!is_pack_valid(p))
 		return error("packfile %s cannot be accessed", p->pack_name);
 
-	r->hash_algo->init_fn(&ctx);
+	r->hash_algo->fast_init_fn(&ctx);
 	do {
 		unsigned long remaining;
 		unsigned char *in = use_pack(p, w_curs, offset, &remaining);
@@ -76,9 +76,9 @@ static int verify_packfile(struct repository *r,
 			pack_sig_ofs = p->pack_size - r->hash_algo->rawsz;
 		if (offset > pack_sig_ofs)
 			remaining -= (unsigned int)(offset - pack_sig_ofs);
-		r->hash_algo->update_fn(&ctx, in, remaining);
+		r->hash_algo->fast_update_fn(&ctx, in, remaining);
 	} while (offset < pack_sig_ofs);
-	r->hash_algo->final_fn(hash, &ctx);
+	r->hash_algo->fast_final_fn(hash, &ctx);
 	pack_sig = use_pack(p, w_curs, pack_sig_ofs, NULL);
 	if (!hasheq(hash, pack_sig, the_repository->hash_algo))
 		err = error("%s pack checksum mismatch",

And now let's teach index-pack to use the weak hash, too:

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index fd968d673d..de99405880 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -284,7 +284,7 @@ static void flush(void)
 	if (input_offset) {
 		if (output_fd >= 0)
 			write_or_die(output_fd, input_buffer, input_offset);
-		the_hash_algo->update_fn(&input_ctx, input_buffer, input_offset);
+		the_hash_algo->fast_update_fn(&input_ctx, input_buffer, input_offset);
 		memmove(input_buffer, input_buffer + input_offset, input_len);
 		input_offset = 0;
 	}
@@ -357,7 +357,7 @@ static const char *open_pack_file(const char *pack_name)
 		output_fd = -1;
 		nothread_data.pack_fd = input_fd;
 	}
-	the_hash_algo->init_fn(&input_ctx);
+	the_hash_algo->fast_init_fn(&input_ctx);
 	return pack_name;
 }
 
@@ -1202,9 +1202,9 @@ static void parse_pack_objects(unsigned char *hash)
 
 	/* Check pack integrity */
 	flush();
-	the_hash_algo->init_fn(&tmp_ctx);
-	the_hash_algo->clone_fn(&tmp_ctx, &input_ctx);
-	the_hash_algo->final_fn(hash, &tmp_ctx);
+	the_hash_algo->fast_init_fn(&tmp_ctx);
+	the_hash_algo->fast_clone_fn(&tmp_ctx, &input_ctx);
+	the_hash_algo->fast_final_fn(hash, &tmp_ctx);
 	if (!hasheq(fill(the_hash_algo->rawsz), hash, the_repository->hash_algo))
 		die(_("pack is corrupted (SHA1 mismatch)"));
 	use(the_hash_algo->rawsz);

Now the code is all set. We need to adjust two things in the victim
repo:

  # we could just send 100+ fake objects from the client, but setting this to
  # "1" makes our attack loop simpler
  git -C victim.git config transfer.unpackLimit 1

  # Pushing happens into a quarantine area. And then when we move the
  # files out of quarantine, we do it with finalize_object_file(), which
  # does the usual link collision detection. But:
  #
  #   1. That doesn't happen for fetches, which are done directly in
  #      the pack directory, so the rename() done by index-pack applies
  #      there.
  #
  #   2. If the link() doesn't work for any reason other than EEXIST,
  #      we'll retry it as a rename(). So depending on your filesystem,
  #      this might be triggerable even for push.
  #
  # To simulate the worst case, we'll manually force the push into
  # rename mode.
  git -C victim.git config core.createObject rename

And now we can attack, like so:

  git init evil
  cd evil
  while git.compile -C ../victim.git fsck; do
          ls -l ../victim.git/objects/pack/
          git.compile commit --allow-empty -m foo
          git.compile push ../victim.git HEAD:foo
  done
  ls -l ../victim.git/objects/pack/

It's random whether we'll collide, but since there are only 4
possibilities (from 2 bits), it happens within a couple attempts.
Obviously sha1 is stronger than that, but I think the point is to
prepare for a world where collisions are cheap and easy to produce
(whether because sha1 gets more broken, or because we start using a
non-cryptographic hash).

So I do think there are problems in this general path, even though your
patch is not (yet) exposing them (although it would be easy to do so
accidentally; I was actually moderately surprised that index-pack is not
relying on the hashfile API already).

Probably the solution is:

  - renaming packfiles into place should use finalize_object_file() to
    avoid collisions.  That happens for tmp-objdir migration already,
    but we should do it more directly in index-pack (and maybe even
    pack-objects). And possibly we should implement an actual
    byte-for-byte comparison if we think we saw a collision, rather than
    just assuming that the write was effectively a noopi (see the FIXME
    in that function). That becomes more important if the checksum gets
    more likely to collide accidentally (we essentially ignore the
    possibility that sha1 would ever do so).

  - possibly object_creation_mode should have a more strict setting that
    refuses to fall back to renames. Or alternatively, we should do our
    own check for existence when falling back to a rename() in
    finalize_object_file().


And finally, I mentioned races earlier. I didn't try to reproduce it,
but I suspect there could also be a race like:

  - you have pack-XYZ.pack and pack-XYZ.idx

  - attacker generates evil pack-XYZ.pack as above (and assuming we have
    no fixed anything as I suggested above, and they really can replace
    the receiver's copy).

  - at some moment we will have moved pack-XYZ.pack into place, but not
    yet the matching idx. So we'll have the old idx and the new pack. An
    object lookup at that moment could cause us to find the object using
    the old idx, but then get the data out of the new pack file,
    replacing real data with the attacker's data. It's a pretty small
    window, but probably possible with enough simultaneous reading
    processes. Not something you'd probably want to spend $40k
    generating a collision for, but if we used a weak enough checksum,
    then attempts become cheap.

So I think we really do need to address this to prefer local data. At
which point it should be safe to use a much weaker checksum. But IMHO it
is still worth having a "fast" SHA1. Even if the future is SHA256 repos
or xxHash pack trailers, older clients will still use SHA1.

-Peff