On 6/28/23 4:13?PM, Kent Overstreet wrote: > On Wed, Jun 28, 2023 at 03:17:43PM -0600, Jens Axboe wrote: >> Case in point, just changed my reproducer to use aio instead of >> io_uring. Here's the full script: >> >> #!/bin/bash >> >> DEV=/dev/nvme1n1 >> MNT=/data >> ITER=0 >> >> while true; do >> echo loop $ITER >> sudo mount $DEV $MNT >> fio --name=test --ioengine=aio --iodepth=2 --filename=$MNT/foo --size=1g --buffered=1 --overwrite=0 --numjobs=12 --minimal --rw=randread --output=/dev/null & >> Y=$(($RANDOM % 3)) >> X=$(($RANDOM % 10)) >> VAL="$Y.$X" >> sleep $VAL >> ps -e | grep fio > /dev/null 2>&1 >> while [ $? -eq 0 ]; do >> killall -9 fio > /dev/null 2>&1 >> echo will wait >> wait > /dev/null 2>&1 >> echo done waiting >> ps -e | grep "fio " > /dev/null 2>&1 >> done >> sudo umount /data >> if [ $? -ne 0 ]; then >> break >> fi >> ((ITER++)) >> done >> >> and if I run that, fails on the first umount attempt in that loop: >> >> axboe@m1max-kvm ~> bash test2.sh >> loop 0 >> will wait >> done waiting >> umount: /data: target is busy. >> >> So yeah, this is _nothing_ new. I really don't think trying to address >> this in the kernel is the right approach, it'd be a lot saner to harden >> the xfstest side to deal with the umount a bit more sanely. There are >> obviously tons of other ways that a mount could get pinned, which isn't >> too relevant here since the bdev and mount point are basically exclusive >> to the test being run. But the kill and delayed fput is enough to make >> that case imho. > > Uh, count me very much not in favor of hacking around bugs elsewhere. > > Al, do you know if this has been considered before? We've got fput() > being called from aio completion, which often runs out of a worqueue (if > not a workqueue, a bottom half of some sort - what happens then, I > wonder) - so the effect is that it goes on the global list, not the task > work list. > > hence, kill -9ing a process doing aio (or io_uring io, for extra > reasons) causes umount to fail with -EBUSY. > > and since there's no mechanism for userspace to deal with this besides > sleep and retry, this seems pretty gross. But there is, as Christian outlined. I would not call it pretty or intuitive, but you can in fact make it work just fine and not just for the deferred fput() case but also in the presence of other kinds of pins. Of which there are of course many. > I'd be willing to tackle this for aio since I know that code... But it's not aio (or io_uring or whatever), it's simply the fact that doing an fput() from an exiting task (for example) will end up being done async. And hence waiting for task exits is NOT enough to ensure that all file references have been released. Since there are a variety of other reasons why a mount may be pinned and fail to umount, perhaps it's worth considering that changing this behavior won't buy us that much. Especially since it's been around for more than 10 years: commit 4a9d4b024a3102fc083c925c242d98ac27b1c5f6 Author: Al Viro <viro@xxxxxxxxxxxxxxxxxx> Date: Sun Jun 24 09:56:45 2012 +0400 switch fput to task_work_add though that commit message goes on to read: We are guaranteed that __fput() will be done before we return to userland (or exit). Note that for fput() from a kernel thread we get an async behaviour; it's almost always OK, but sometimes you might need to have __fput() completed before you do anything else. There are two mechanisms for that - a general barrier (flush_delayed_fput()) and explicit __fput_sync(). Both should be used with care (as was the case for fput() from kernel threads all along). See comments in fs/file_table.c for details. where that first sentence isn't true if the task is indeed exiting. I guess you can say that it is as it doesn't return to userland, but splitting hairs. Though the commit in question doesn't seem to handle that case, but assuming that came in with a later fixup. It is true if the task_work gets added, as that will get run before returning to userspace. If a case were to be made that we also guarantee that fput has been done by the time to task returns to userspace, or exits, then we'd probably want to move that deferred fput list to the task_struct and ensure that it gets run if the task exits rather than have a global deferred list. Currently we have: 1) If kthread or in interrupt 1a) add to global fput list 2) task_work_add if not. If that fails, goto 1a. which would then become: 1) If kthread or in interrupt 1a) add to global fput list 2) task_work_add if not. If that fails, we know task is existing. add to per-task defer list to be run at a convenient time before task has exited. and seems a lot saner than hacking around this in umount specifically. -- Jens Axboe