Re: [PATCH] object-file: fix race in object collision check

Jeff King <peff@xxxxxxxx> · Wed, 1 Jan 2025 13:19:52 -0500

On Wed, Jan 01, 2025 at 08:50:51AM -0800, Junio C Hamano wrote:

> Jeff King <peff@xxxxxxxx> writes:
> 
> > There is one gotcha here, though. We call this collision check only if
> > we got EEXIST trying to move the tempfile into place. If the destination
> > file then goes away, we can't do the collision check. But is it right to
> > quietly return success?
> >
> > If the contents of the two were the same, that's fine. We don't need the
> > extra copy.
> >
> > But if the contents were not the same, we'd prefer either to actually
> > copy the contents into place, or to return an error.
> >
> > Of course we can't know, because the destination file has gone away. In
> > the common case they will be the same, but the whole point of this check
> > is to allow loosening the cryptographic collision of the packfile
> > contents. So the safest thing would be to retain the tempfile, copying
> > it into the destination file. That errs on the side of keeping data when
> > we cannot make a determination.
> >
> > IOW, if we see ENOENT on filename_b, should we then loop back in the
> > caller to try the link() again?
> 
> Yuck, I think you're absolutely right.

So I think this part, if adjusted as I suggested, would fix the race in
the tests without making anything worse (it's just more code).

And then this...

> > I think check_collision() is used _after_ the attempt to rename() into
> > place. So there's a race when the tempfile goes away, but I think the
> > outcome is made a bit worse by your patch.
> >
> > Consider a sequence like this:
> >
> >   a. Process A writes tmp_pack_foo.
> >
> >   b. Process A tries to link tmp_pack_foo to pack-<hash> but finds it
> >      already exists.
> >
> >   c. Process A opens both tmp_pack_foo and pack-<hash>.
> >
> >   d. Process A compares the two byte-for-byte, and then returns
> >      success/failure based on whether they were actually identical.
> >
> > Now imagine there is a process B that deletes the file (maybe an
> > over-zealous "gc --prune=now" deletes the in-use temporary file):
> >
> >  - if process B deletes it between steps (a) and (b), process A returns
> >    an error (there is nothing to link). The caller knows that the data
> >    was not stored.
> >
> >  - if process B deletes it between (b) and (c), then before your patch
> >    we see an error (because we can't compare the files). After your
> >    patch, we continue on and return success. The caller knows the data
> >    was stored (via the original file, not our new copy).
> >
> >  - if process B deletes it between (c) and (d), then process A has no
> >    idea. But at this point it does not matter. If the files were
> >    identical, we return success (and in fact, process A deletes the file
> >    itself). And if not identical, then we return error, and the callers
> >    knows the data was not stored.
> >
> > So even though the exact behavior may depend on where we hit the race, I
> > think ignoring an ENOENT open() error on the tempfile meaningfully
> > changes what happens in the middle case.
> >
> > In practice I don't really expect this to happen, and "gc --prune=now"
> > is inherently risky in a live repository. But I think we're probably
> > better off to continue treating it as an error if we can't open our own
> > tempfile.
> 
> So we'd ignore the racy and flaky tests, as hiding the flake by
> ignoring the error would only hurt the real world users.

...is all about ignoring ENOENT on the tmpfile itself. And I think we
can just drop that hunk entirely. The tests do not care here (they are
running simultaneous gc, but _not_ a simultaneous "--prune=now" gc).

-Peff