Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread

Hao Xu <haoxu@xxxxxxxxxxxxxxxxx> · Thu, 29 Apr 2021 12:37:01 +0800

在 2021/4/28 下午10:37, Pavel Begunkov 写道:
On 4/28/21 3:34 PM, Pavel Begunkov wrote:
On 4/28/21 2:32 PM, Hao Xu wrote:
sqes are submitted by sqthread when it is leveraged, which means there
is IO latency when waking up sqthread. To wipe it out, submit limited
number of sqes in the original task context.
Tests result below:

Frankly, it can be a nest of corner cases if not now then in the future,
leading to a high maintenance burden. Hence, if we consider the change,
I'd rather want to limit the userspace exposure, so it can be removed
if needed.

A noticeable change of behaviour here, as Hao recently asked, is that
the ring can be passed to a task from a completely another thread group,
and so the feature would execute from that context, not from the
original/sqpoll one.

So maybe something like:
if (same_thread_group()) {
	/* submit */
}I thought this case(cross independent processes) for some time, Pavel,
could you give more hints about how this may trigger errors?


Not sure IORING_ENTER_SQ_DEPUTY knob is needed, but at least can be
ignored if the previous point is addressed.

I'd question whether it'd be better with the flag or without doing
this feature by default.
Just like what Jens said, the flag here is to allow users to do their
decision, there may be cases like a application wants to offload as much
as possible IO related work to sqpoll, so that it can be dedicated to
computation work etc.



99th latency:
iops\idle	10us	60us	110us	160us	210us	260us	310us	360us	410us	460us	510us
with this patch:
2k      	13	13	12	13	13	12	12	11	11	10.304	11.84
without this patch:
2k      	15	14	15	15	15	14	15	14	14	13	11.84

Not sure the second nine describes it well enough, please can you
add more data? Mean latency, 50%, 90%, 99%, 99.9%, t-put.
Sure, I will.

Btw, how happened that only some of the numbers have fractional part?
Can't believe they all but 3 were close enough to integer values.
This confused me a little bit too, but it is indeed what fio outputs.

fio config:
./run_fio.sh
fio \
--ioengine=io_uring --sqthread_poll=1 --hipri=1 --thread=1 --bs=4k \
--direct=1 --rw=randread --time_based=1 --runtime=300 \
--group_reporting=1 --filename=/dev/nvme1n1 --sqthread_poll_cpu=30 \
--randrepeat=0 --cpus_allowed=35 --iodepth=128 --rate_iops=${1} \
--io_sq_thread_idle=${2}

Signed-off-by: Hao Xu <haoxu@xxxxxxxxxxxxxxxxx>
---
  fs/io_uring.c                 | 29 +++++++++++++++++++++++------
  include/uapi/linux/io_uring.h |  1 +
  2 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 1871fad48412..f0a01232671e 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1252,7 +1252,12 @@ static void io_queue_async_work(struct io_kiocb *req)
  {
  	struct io_ring_ctx *ctx = req->ctx;
  	struct io_kiocb *link = io_prep_linked_timeout(req);
-	struct io_uring_task *tctx = req->task->io_uring;
+	struct io_uring_task *tctx = NULL;
+
+	if (ctx->sq_data && ctx->sq_data->thread)
+		tctx = ctx->sq_data->thread->io_uring;

without park it's racy, sq_data->thread may become NULL and removed,
as well as its ->io_uring.

+	else
+		tctx = req->task->io_uring;
  
  	BUG_ON(!tctx);
  	BUG_ON(!tctx->io_wq);

[snip]