Re: [PATCH v3 3/9] finalize_object_file(): implement collision check

Taylor Blau <me@xxxxxxxxxxxx> · Mon, 16 Sep 2024 12:03:05 -0400

On Mon, Sep 16, 2024 at 11:54:14AM -0400, Taylor Blau wrote:
> On Mon, Sep 16, 2024 at 12:45:38PM +0200, Patrick Steinhardt wrote:
> > This function compares the exact contents, but isn't that wrong? The
> > contents may differ even though the object is the same because the
> > object hash is derived from the uncompressed data, whereas we store
> > compressed data on disk.
> >
> > So this might trigger when you have different zlib versions, but also if
> > you configure core.compression differently.
>
> Oh, shoot -- you're right for when we call finalize_object_file() on
> loose objects.
>
> For packfiles and other spots where we use the hashfile API, the
> name of the file depends on the checksum'd contents, so we're safe
> there. But for loose objects the, the name of the file is based on the
> object hash, which is the hash of the uncompressed contents.
>
> So I think we would need something like this on top:

Thinking about this a little more, I think that most cases should
actually be OK. Of course, this only affects repositories that are
changing their zlib version, and/or changing their core.compression
setting regularly, so this is all pretty niche, but still...

We often guard calls to write_loose_object() with either calling
freshen_loose_object(), or freshen_packed_object(), which will update
the mtime of the containing pack or loose object path for a given hash.

So if we already have a loose object with hash X in the object store,
and we try to write that same object again with a different zlib
version and/or core.compression setting, we'll simply call
freshen_loose_object() and optimize out the actual object write.

Of course, that's not always the case. We might e.g., force an object
loose via loosen_packed_objects() which could encounter a TOCTOU race
with an incoming object write using a different compression setting. But
that seems exceedingly rare to me.

I think that we should still take that change in the mail I'm replying
to, but I would definitely appreciate others' thoughts here.

Thanks,
Taylor