On Mon, Dec 30, 2024 at 03:50:04PM +0100, Patrick Steinhardt wrote: > On Mon, Dec 30, 2024 at 06:40:53AM -0800, Junio C Hamano wrote: > > Patrick Steinhardt <ps@xxxxxx> writes: > > > > > By definition, two files cannot collide with each other when one of them > > > has been removed. We can thus trivially fix the issue by ignoring ENOENT > > > when opening either of the files we're about to check for collision. > > > > Thanks for digging it down to the cause. > > > > It is more like even if these two files collided (i.e. have the same > > name based on what the hash function says, with different contents), > > when one of them has been removed, we have no way to check if the > > collision is benign, and even if it were not, we cannot do anything > > about it, isn't it? > > Depends on what "benign" means in this context, I guess. We can only > assert the most trivial case of it being "benign", namely that we have > computed a packfile that actually is the exact same. This is also going > to be the most common case, as everything else would depend on a > cryptographic collision of the packfile contents. And in that case... we > cannot do anything about it, yes. There is one gotcha here, though. We call this collision check only if we got EEXIST trying to move the tempfile into place. If the destination file then goes away, we can't do the collision check. But is it right to quietly return success? If the contents of the two were the same, that's fine. We don't need the extra copy. But if the contents were not the same, we'd prefer either to actually copy the contents into place, or to return an error. Of course we can't know, because the destination file has gone away. In the common case they will be the same, but the whole point of this check is to allow loosening the cryptographic collision of the packfile contents. So the safest thing would be to retain the tempfile, copying it into the destination file. That errs on the side of keeping data when we cannot make a determination. IOW, if we see ENOENT on filename_b, should we then loop back in the caller to try the link() again? > > I do like the simplicity of the solution. I wonder given bad enough > > race, we could fall into a case where both files are missing? > > I was wondering about that, too, but it would very much feel like a bug > to me if that were ever to happen. So I briefly considered whether I > should treat the passed-in filenames differently: > > - One that must exist non-racily. This is our temporary object or > packfile that we want to move into place. > > - And one that may have been removed racily. This is our target file > path that we want to overwrite, unless there is a collision. > > The idea would be to only handle ENOENT for the second case. But in the > end I don't think it's worth the complexity because `check_collision()` > is used before rename(3p)ing the former into place, and that function > would already notice ENOENT anyway. So we would eventually just die the > same. I think check_collision() is used _after_ the attempt to rename() into place. So there's a race when the tempfile goes away, but I think the outcome is made a bit worse by your patch. Consider a sequence like this: a. Process A writes tmp_pack_foo. b. Process A tries to link tmp_pack_foo to pack-<hash> but finds it already exists. c. Process A opens both tmp_pack_foo and pack-<hash>. d. Process A compares the two byte-for-byte, and then returns success/failure based on whether they were actually identical. Now imagine there is a process B that deletes the file (maybe an over-zealous "gc --prune=now" deletes the in-use temporary file): - if process B deletes it between steps (a) and (b), process A returns an error (there is nothing to link). The caller knows that the data was not stored. - if process B deletes it between (b) and (c), then before your patch we see an error (because we can't compare the files). After your patch, we continue on and return success. The caller knows the data was stored (via the original file, not our new copy). - if process B deletes it between (c) and (d), then process A has no idea. But at this point it does not matter. If the files were identical, we return success (and in fact, process A deletes the file itself). And if not identical, then we return error, and the callers knows the data was not stored. So even though the exact behavior may depend on where we hit the race, I think ignoring an ENOENT open() error on the tempfile meaningfully changes what happens in the middle case. In practice I don't really expect this to happen, and "gc --prune=now" is inherently risky in a live repository. But I think we're probably better off to continue treating it as an error if we can't open our own tempfile. -Peff