On 6/28/23 2:44?PM, Jens Axboe wrote: > On 6/28/23 11:52?AM, Kent Overstreet wrote: >> On Wed, Jun 28, 2023 at 10:57:02AM -0600, Jens Axboe wrote: >>> I discussed this with Christian offline. I have a patch that is pretty >>> simple, but it does mean that you'd wait for delayed fput flush off >>> umount. Which seems kind of iffy. >>> >>> I think we need to back up a bit and consider if the kill && umount >>> really is sane. If you kill a task that has open files, then any fput >>> from that task will end up being delayed. This means that the umount may >>> very well fail. >>> >>> It'd be handy if we could have umount wait for that to finish, but I'm >>> not at all confident this is a sane solution for all cases. And as >>> discussed, we have no way to even identify which files we'd need to >>> flush out of the delayed list. >>> >>> Maybe the test case just needs fixing? Christian suggested lazy/detach >>> umount and wait for sb release. There's an fsnotify hook for that, >>> fsnotify_sb_delete(). Obviously this is a bit more involved, but seems >>> to me that this would be the way to make it more reliable when killing >>> of tasks with open files are involved. >> >> No, this is a real breakage. Any time we introduce unexpected >> asynchrony there's the potential for breakage: case in point, there was >> a filesystem that made rm asynchronous, then there were scripts out >> there that deleted until df showed under some threshold.. whoops... > > This is nothing new - any fput done from an exiting task will end up > being deferred. The window may be a bit wider now or a bit different, > but it's the same window. If an application assumes it can kill && wait > on a task and be guaranteed that the files are released as soon as wait > returns, it is mistaken. That is NOT the case. Case in point, just changed my reproducer to use aio instead of io_uring. Here's the full script: #!/bin/bash DEV=/dev/nvme1n1 MNT=/data ITER=0 while true; do echo loop $ITER sudo mount $DEV $MNT fio --name=test --ioengine=aio --iodepth=2 --filename=$MNT/foo --size=1g --buffered=1 --overwrite=0 --numjobs=12 --minimal --rw=randread --output=/dev/null & Y=$(($RANDOM % 3)) X=$(($RANDOM % 10)) VAL="$Y.$X" sleep $VAL ps -e | grep fio > /dev/null 2>&1 while [ $? -eq 0 ]; do killall -9 fio > /dev/null 2>&1 echo will wait wait > /dev/null 2>&1 echo done waiting ps -e | grep "fio " > /dev/null 2>&1 done sudo umount /data if [ $? -ne 0 ]; then break fi ((ITER++)) done and if I run that, fails on the first umount attempt in that loop: axboe@m1max-kvm ~> bash test2.sh loop 0 will wait done waiting umount: /data: target is busy. So yeah, this is _nothing_ new. I really don't think trying to address this in the kernel is the right approach, it'd be a lot saner to harden the xfstest side to deal with the umount a bit more sanely. There are obviously tons of other ways that a mount could get pinned, which isn't too relevant here since the bdev and mount point are basically exclusive to the test being run. But the kill and delayed fput is enough to make that case imho. -- Jens Axboe