Re: [PATCH v4 3/8] finalize_object_file(): implement collision check

Junio C Hamano <gitster@xxxxxxxxx> · Tue, 24 Sep 2024 14:32:21 -0700

Taylor Blau <me@xxxxxxxxxxxx> writes:

> One of the reasons we never implemented this is because the files it
> moves are all named after the cryptographic hash of their contents
> (either loose objects, or packs which have their hash in the name these
> days).

So we are saying that both loose objects and packfiles would benefit
if we did collision checks here.

> ... So in preparation, let's actually
> implement a byte-for-byte collision check.

But the byte-for-byte collision check is only good for packfiles.

In other words, the definition of "content" that gets hashed to
derive the name can differ among csum-file users, so the way to
check for collision can be written for these different types.

> Note that this may cause some extra computation when the files are in
> fact identical, but this should happen rarely.
>
> Loose objects are exempt from this check, and the collision check may be
> skipped by calling the _flags variant of this function with the
> FOF_SKIP_COLLISION_CHECK bit set. This is done for a couple of reasons:
>
>   - We don't treat the hash of the loose object file's contents as a
>     checksum, since the same loose object can be stored using different
>     bytes on disk (e.g., when adjusting core.compression, using a
>     different version of zlib, etc.).

That's a good explanation why byte-for-byte check is inappropriate.

>   - We already use the path of the loose object as its hash value /
>     object name, so checking for collisions at the content level doesn't
>     add anything.

"We already ... doesn't add anything" -> "The point of collision
check is to find an attempt to store different contents that hashes
down to the same object name, and we continue to use sha1dc to hash
the content name so such a collision would have already been
detected".

>     This is why we do not bother to check the inflated object contents
>     for collisions either, since either (a) the object contents have the
>     fingerprint of a SHA-1 collision, in which case the collision
>     detecting SHA-1 implementation used to hash the contents to give us
>     a path would have already rejected it, or (b) the contents are part
>     of a colliding pair which does not bear the same fingerprints of
>     known collision attacks, in which case we would not have caught it
>     anyway.

I do not understand (b).  If you compare inflated contents, you
would find such a pair that sha1dc missed but the "implement
collision check" patch would have caught as a pair of byte sequences
that hash to the same value.  Isn't it an opportunity to make it safer
to implement such a loose object collision check?

Thanks.