On 8/10/23 5:47 PM, Linus Torvalds wrote: > On Thu, 10 Aug 2023 at 15:39, Darrick J. Wong <djwong@xxxxxxxxxx> wrote: >> >> FWIW I recently fixed all my stupid debian package dependencies so that >> I could actually install liburing again, and rebuilt fstests. The very >> next morning I noticed a number of new test failures in /exactly/ the >> way that Kent said to expect: >> >> fsstress -d /mnt & <sleep then simulate fs crash>; \ >> umount /mnt; mount /dev/sda /mnt >> >> Here, umount exits before the filesystem is really torn down, and then >> mount fails because it can't get an exclusive lock on the device. > > I agree that that obviously sounds like mount is just returning either > too early. Or too eagerly. > > But I suspect any delayed fput() issues (whether from aio or io_uring) > are then just a way to trigger the problem, not the fundamental cause. > > Because even if the fput() is delayed, the mntput() part of that > delayed __fput action is the one that *should* have kept the > filesystem mounted until it is no longer busy. > > And more importantly, having some of the common paths synchronize > *their* fput() calls only affects those paths. > > It doesn't affect the fundamental issue that the last fput() can > happen in odd contexts when the file descriptor was used for something > a bit stranger. > > So I do feel like the fput patch I saw looked more like a "hide the > problem" than a real fix. The fput patch was not pretty, nor is it needed. What happens on the io_uring side is that pending requests (which can hold files referenced) are canceled on exit. But we don't wait for the references to go away, which then introduces this race. I've used this to trigger it: #!/bin/bash DEV=/dev/nvme0n1 MNT=/data ITER=0 while true; do echo loop $ITER sudo mount $DEV $MNT fio --name=test --ioengine=io_uring --iodepth=2 --filename=$MNT/foo --size=1g --buffered=1 --overwrite=0 --numjobs=12 --minimal --rw=randread --thread=1 --output=/dev/null & Y=$(($RANDOM % 3)) X=$(($RANDOM % 10)) VAL="$Y.$X" sleep $VAL ps -e | grep fio > /dev/null 2>&1 while [ $? -eq 0 ]; do killall -9 fio > /dev/null 2>&1 wait > /dev/null 2>&1 ps -e | grep "fio " > /dev/null 2>&1 done sudo umount /data if [ $? -ne 0 ]; then break fi ((ITER++)) done and can make it happen pretty easily, within a few iterations. Contrary to how it was otherwise presented in this thread, I did take a look at this a month ago and wrote up some patches for it. Just rebased them on the current tree: https://git.kernel.dk/cgit/linux/log/?h=io_uring-exit-cancel Since we have task_work involved for both the completions and the __fput(), ordering is a concern which is why it needs a bit more effort than just the bare bones stuff. The way the task_work list works, we llist_del_all() and run all items. But we do encapsulate that in io_uring anyway, so it's possible to run our pending local items and avoid that particular snag. WIP obviously, the first 3-4 prep patches were posted earlier today, but I'm not happy with the last 3 yet in the above branch. Or at least not fully confident, so will need a bit more thinking and testing. Does pass the above test case, and the regular liburing test/regression cases, though. -- Jens Axboe