Re: [GIT PULL] bcachefs

Jens Axboe <axboe@xxxxxxxxx> · Wed, 28 Jun 2023 16:33:55 -0600

On 6/28/23 4:13?PM, Kent Overstreet wrote:
> On Wed, Jun 28, 2023 at 03:17:43PM -0600, Jens Axboe wrote:
>> Case in point, just changed my reproducer to use aio instead of
>> io_uring. Here's the full script:
>>
>> #!/bin/bash
>>
>> DEV=/dev/nvme1n1
>> MNT=/data
>> ITER=0
>>
>> while true; do
>> 	echo loop $ITER
>> 	sudo mount $DEV $MNT
>> 	fio --name=test --ioengine=aio --iodepth=2 --filename=$MNT/foo --size=1g --buffered=1 --overwrite=0 --numjobs=12 --minimal --rw=randread --output=/dev/null &
>> 	Y=$(($RANDOM % 3))
>> 	X=$(($RANDOM % 10))
>> 	VAL="$Y.$X"
>> 	sleep $VAL
>> 	ps -e | grep fio > /dev/null 2>&1
>> 	while [ $? -eq 0 ]; do
>> 		killall -9 fio > /dev/null 2>&1
>> 		echo will wait
>> 		wait > /dev/null 2>&1
>> 		echo done waiting
>> 		ps -e | grep "fio " > /dev/null 2>&1
>> 	done
>> 	sudo umount /data
>> 	if [ $? -ne 0 ]; then
>> 		break
>> 	fi
>> 	((ITER++))
>> done
>>
>> and if I run that, fails on the first umount attempt in that loop:
>>
>> axboe@m1max-kvm ~> bash test2.sh
>> loop 0
>> will wait
>> done waiting
>> umount: /data: target is busy.
>>
>> So yeah, this is _nothing_ new. I really don't think trying to address
>> this in the kernel is the right approach, it'd be a lot saner to harden
>> the xfstest side to deal with the umount a bit more sanely. There are
>> obviously tons of other ways that a mount could get pinned, which isn't
>> too relevant here since the bdev and mount point are basically exclusive
>> to the test being run. But the kill and delayed fput is enough to make
>> that case imho.
> 
> Uh, count me very much not in favor of hacking around bugs elsewhere.
> 
> Al, do you know if this has been considered before? We've got fput()
> being called from aio completion, which often runs out of a worqueue (if
> not a workqueue, a bottom half of some sort - what happens then, I
> wonder) - so the effect is that it goes on the global list, not the task
> work list.
> 
> hence, kill -9ing a process doing aio (or io_uring io, for extra
> reasons) causes umount to fail with -EBUSY.
> 
> and since there's no mechanism for userspace to deal with this besides
> sleep and retry, this seems pretty gross.

But there is, as Christian outlined. I would not call it pretty or
intuitive, but you can in fact make it work just fine and not just for
the deferred fput() case but also in the presence of other kinds of
pins. Of which there are of course many.

> I'd be willing to tackle this for aio since I know that code...

But it's not aio (or io_uring or whatever), it's simply the fact that
doing an fput() from an exiting task (for example) will end up being
done async. And hence waiting for task exits is NOT enough to ensure
that all file references have been released.

Since there are a variety of other reasons why a mount may be pinned and
fail to umount, perhaps it's worth considering that changing this
behavior won't buy us that much. Especially since it's been around for
more than 10 years:

commit 4a9d4b024a3102fc083c925c242d98ac27b1c5f6
Author: Al Viro <viro@xxxxxxxxxxxxxxxxxx>
Date:   Sun Jun 24 09:56:45 2012 +0400

    switch fput to task_work_add

though that commit message goes on to read:

    We are guaranteed that __fput() will be done before we return
    to userland (or exit).  Note that for fput() from a kernel
    thread we get an async behaviour; it's almost always OK, but
    sometimes you might need to have __fput() completed before
    you do anything else.  There are two mechanisms for that -
    a general barrier (flush_delayed_fput()) and explicit
    __fput_sync().  Both should be used with care (as was the
    case for fput() from kernel threads all along).  See comments
    in fs/file_table.c for details.

where that first sentence isn't true if the task is indeed exiting. I
guess you can say that it is as it doesn't return to userland, but
splitting hairs. Though the commit in question doesn't seem to handle
that case, but assuming that came in with a later fixup.

It is true if the task_work gets added, as that will get run before
returning to userspace.

If a case were to be made that we also guarantee that fput has been done
by the time to task returns to userspace, or exits, then we'd probably
want to move that deferred fput list to the task_struct and ensure that
it gets run if the task exits rather than have a global deferred list.
Currently we have:

1) If kthread or in interrupt
	1a) add to global fput list
2) task_work_add if not. If that fails, goto 1a.

which would then become:

1) If kthread or in interrupt
	1a) add to global fput list
2) task_work_add if not. If that fails, we know task is existing. add to
   per-task defer list to be run at a convenient time before task has
   exited.

and seems a lot saner than hacking around this in umount specifically.

-- 
Jens Axboe