Re: [PATCH 0/4] hash.h: support choosing a separate SHA-1 for non-cryptographic uses

Junio C Hamano <gitster@xxxxxxxxx> · Thu, 05 Sep 2024 08:41:16 -0700

Jeff King <peff@xxxxxxxx> writes:

> Probably the solution is:
>
>   - renaming packfiles into place should use finalize_object_file() to
>     avoid collisions.  That happens for tmp-objdir migration already,
>     but we should do it more directly in index-pack (and maybe even
>     pack-objects). And possibly we should implement an actual
>     byte-for-byte comparison if we think we saw a collision, rather than
>     just assuming that the write was effectively a noopi (see the FIXME
>     in that function). That becomes more important if the checksum gets
>     more likely to collide accidentally (we essentially ignore the
>     possibility that sha1 would ever do so).

Yes.  I somehow mistakenly thought that Taylor analized the code
path when brian raised the "we rename over, overwriting existing
files" and I included fixing it as one of the steps necessary to
safely switch the tail sum to weaker and faster hash, but after
reading the thread again, it seems that I was hallucinating X-<.
This needs to be corrected.

>   - possibly object_creation_mode should have a more strict setting that
>     refuses to fall back to renames. Or alternatively, we should do our
>     own check for existence when falling back to a rename() in
>     finalize_object_file().

True, too.

>   - at some moment we will have moved pack-XYZ.pack into place, but not
>     yet the matching idx. So we'll have the old idx and the new pack. An
>     object lookup at that moment could cause us to find the object using
>     the old idx, but then get the data out of the new pack file,
>     replacing real data with the attacker's data. It's a pretty small
>     window, but probably possible with enough simultaneous reading
>     processes. Not something you'd probably want to spend $40k
>     generating a collision for, but if we used a weak enough checksum,
>     then attempts become cheap.

This reminds me why we changed the hash we use to name packfiles; we
used to use "hash of sorted object names contained in the pack", but
that would mean a (forced) repack of a sole pack of a fully packed
repository can create a packfile with contents and object layout
different from the original but with the same name, creating this
exact race to yourself without involving any evil attacker.  We of
course use the hash of the actual pack data stream these days, and
that would avoid this problem. 

It is funny to compare this with the reason why we switched how we
name individual objects in a very early part in the history.  We
used to name an object after the hash value of _compressed_ object
header plus payload, but that obviously means different compression
level (or improvement of the compressor implementation) would give
different names to the same contents, and that is why we swapped the
order to use the hash value of the object header plus payload before
compression.  Of course, that _requires_ us to avoid overwriting an
existing file to foil collision attack.  That brings us back to the
topic here ;-).

> So I think we really do need to address this to prefer local data. At
> which point it should be safe to use a much weaker checksum. But IMHO it
> is still worth having a "fast" SHA1. Even if the future is SHA256 repos
> or xxHash pack trailers, older clients will still use SHA1.

Yup. 100% agreed.

Thanks.