Re: [PATCH] object-file: fix race in object collision check

Jeff King <peff@xxxxxxxx> · Mon, 30 Dec 2024 20:42:20 -0500

On Mon, Dec 30, 2024 at 03:50:04PM +0100, Patrick Steinhardt wrote:

> On Mon, Dec 30, 2024 at 06:40:53AM -0800, Junio C Hamano wrote:
> > Patrick Steinhardt <ps@xxxxxx> writes:
> > 
> > > By definition, two files cannot collide with each other when one of them
> > > has been removed. We can thus trivially fix the issue by ignoring ENOENT
> > > when opening either of the files we're about to check for collision.
> > 
> > Thanks for digging it down to the cause.
> > 
> > It is more like even if these two files collided (i.e. have the same
> > name based on what the hash function says, with different contents),
> > when one of them has been removed, we have no way to check if the
> > collision is benign, and even if it were not, we cannot do anything
> > about it, isn't it?
> 
> Depends on what "benign" means in this context, I guess. We can only
> assert the most trivial case of it being "benign", namely that we have
> computed a packfile that actually is the exact same. This is also going
> to be the most common case, as everything else would depend on a
> cryptographic collision of the packfile contents. And in that case... we
> cannot do anything about it, yes.

There is one gotcha here, though. We call this collision check only if
we got EEXIST trying to move the tempfile into place. If the destination
file then goes away, we can't do the collision check. But is it right to
quietly return success?

If the contents of the two were the same, that's fine. We don't need the
extra copy.

But if the contents were not the same, we'd prefer either to actually
copy the contents into place, or to return an error.

Of course we can't know, because the destination file has gone away. In
the common case they will be the same, but the whole point of this check
is to allow loosening the cryptographic collision of the packfile
contents. So the safest thing would be to retain the tempfile, copying
it into the destination file. That errs on the side of keeping data when
we cannot make a determination.

IOW, if we see ENOENT on filename_b, should we then loop back in the
caller to try the link() again?

> > I do like the simplicity of the solution.  I wonder given bad enough
> > race, we could fall into a case where both files are missing?
> 
> I was wondering about that, too, but it would very much feel like a bug
> to me if that were ever to happen. So I briefly considered whether I
> should treat the passed-in filenames differently: 
> 
>   - One that must exist non-racily. This is our temporary object or
>     packfile that we want to move into place.
> 
>   - And one that may have been removed racily. This is our target file
>     path that we want to overwrite, unless there is a collision.
> 
> The idea would be to only handle ENOENT for the second case. But in the
> end I don't think it's worth the complexity because `check_collision()`
> is used before rename(3p)ing the former into place, and that function
> would already notice ENOENT anyway. So we would eventually just die the
> same.

I think check_collision() is used _after_ the attempt to rename() into
place. So there's a race when the tempfile goes away, but I think the
outcome is made a bit worse by your patch.

Consider a sequence like this:

  a. Process A writes tmp_pack_foo.

  b. Process A tries to link tmp_pack_foo to pack-<hash> but finds it
     already exists.

  c. Process A opens both tmp_pack_foo and pack-<hash>.

  d. Process A compares the two byte-for-byte, and then returns
     success/failure based on whether they were actually identical.

Now imagine there is a process B that deletes the file (maybe an
over-zealous "gc --prune=now" deletes the in-use temporary file):

 - if process B deletes it between steps (a) and (b), process A returns
   an error (there is nothing to link). The caller knows that the data
   was not stored.

 - if process B deletes it between (b) and (c), then before your patch
   we see an error (because we can't compare the files). After your
   patch, we continue on and return success. The caller knows the data
   was stored (via the original file, not our new copy).

 - if process B deletes it between (c) and (d), then process A has no
   idea. But at this point it does not matter. If the files were
   identical, we return success (and in fact, process A deletes the file
   itself). And if not identical, then we return error, and the callers
   knows the data was not stored.

So even though the exact behavior may depend on where we hit the race, I
think ignoring an ENOENT open() error on the tempfile meaningfully
changes what happens in the middle case.

In practice I don't really expect this to happen, and "gc --prune=now"
is inherently risky in a live repository. But I think we're probably
better off to continue treating it as an error if we can't open our own
tempfile.

-Peff