[PATCHSET 0/3] Ensure file refs are dropped on io_uring fd release

Jens Axboe <axboe@xxxxxxxxx> · Fri, 11 Aug 2023 11:12:39 -0600

Hi,

When the io_uring fd is closed, we ensure that any pending or future
request is canceled when run. But we don't wait for that to happen,
it happens out-of-line in a workqueue. This means that if you kill a
task that has pending IO with io_uring, when the process has exited,
there's still an amount of time until all file references has gone away.
This makes a test case like:

#!/bin/bash

DEV=/dev/nvme0n1
MNT=/data
ITER=0

while true; do
	echo loop $ITER
	sudo mount $DEV $MNT
	fio --name=test --ioengine=io_uring --iodepth=2 --filename=$MNT/foo --size=1g --buffered=1 --overwrite=0 --numjobs=12 --minimal --rw=randread --thread=1 --output=/dev/null &
	Y=$(($RANDOM % 3))
	X=$(($RANDOM % 10))
	VAL="$Y.$X"
	sleep $VAL
	ps -e | grep fio > /dev/null 2>&1
	while [ $? -eq 0 ]; do
		killall -9 fio > /dev/null 2>&1
		wait > /dev/null 2>&1
		ps -e | grep "fio " > /dev/null 2>&1
	done
	sudo umount /data
	if [ $? -ne 0 ]; then
		break
	fi
	((ITER++))
done

fail with -EBUSY for the umount, even though the task that had them
open is gone and reaped.

This patchset attempts to rectify that. Patch 1 switches us to a
simpler private percpu reference count, which means we don't need to
sync RCU when exiting. A RCU grace period can take a long time, and now
the task will be waiting for that when exiting the ring fd. Patch 2
tweaks when we consider things cancelable, moving away from using
PF_EXITING and just looking at the ring ref state instead. Patch 3
finally does the trivial "wait for cancelations to happen before
considering the fd closed" trick which fixes the above test case.

-- 
Jens Axboe