Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread

Pavel Begunkov <asml.silence@xxxxxxxxx> · Thu, 29 Apr 2021 23:10:12 +0100

On 4/29/21 9:44 AM, Hao Xu wrote:
> 在 2021/4/28 下午10:34, Pavel Begunkov 写道:
>> On 4/28/21 2:32 PM, Hao Xu wrote:
>>> sqes are submitted by sqthread when it is leveraged, which means there
>>> is IO latency when waking up sqthread. To wipe it out, submit limited
>>> number of sqes in the original task context.
>>> Tests result below:
>>
>> Frankly, it can be a nest of corner cases if not now then in the future,
>> leading to a high maintenance burden. Hence, if we consider the change,
>> I'd rather want to limit the userspace exposure, so it can be removed
>> if needed.
>>
>> A noticeable change of behaviour here, as Hao recently asked, is that
>> the ring can be passed to a task from a completely another thread group,
>> and so the feature would execute from that context, not from the
>> original/sqpoll one.
>>
>> Not sure IORING_ENTER_SQ_DEPUTY knob is needed, but at least can be
>> ignored if the previous point is addressed.
>>
>>>
>>> 99th latency:
>>> iops\idle    10us    60us    110us    160us    210us    260us    310us    360us    410us    460us    510us
>>> with this patch:
>>> 2k          13    13    12    13    13    12    12    11    11    10.304    11.84
>>> without this patch:
>>> 2k          15    14    15    15    15    14    15    14    14    13    11.84
>>
>> Not sure the second nine describes it well enough, please can you
>> add more data? Mean latency, 50%, 90%, 99%, 99.9%, t-put.
>>
>> Btw, how happened that only some of the numbers have fractional part?
>> Can't believe they all but 3 were close enough to integer values.
>>
>>> fio config:
>>> ./run_fio.sh
>>> fio \
>>> --ioengine=io_uring --sqthread_poll=1 --hipri=1 --thread=1 --bs=4k \
>>> --direct=1 --rw=randread --time_based=1 --runtime=300 \
>>> --group_reporting=1 --filename=/dev/nvme1n1 --sqthread_poll_cpu=30 \
>>> --randrepeat=0 --cpus_allowed=35 --iodepth=128 --rate_iops=${1} \
>>> --io_sq_thread_idle=${2}
>>>
>>> Signed-off-by: Hao Xu <haoxu@xxxxxxxxxxxxxxxxx>
>>> ---
>>>   fs/io_uring.c                 | 29 +++++++++++++++++++++++------
>>>   include/uapi/linux/io_uring.h |  1 +
>>>   2 files changed, 24 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>> index 1871fad48412..f0a01232671e 100644
>>> --- a/fs/io_uring.c
>>> +++ b/fs/io_uring.c
>>> @@ -1252,7 +1252,12 @@ static void io_queue_async_work(struct io_kiocb *req)
>>>   {
>>>       struct io_ring_ctx *ctx = req->ctx;
>>>       struct io_kiocb *link = io_prep_linked_timeout(req);
>>> -    struct io_uring_task *tctx = req->task->io_uring;
>>> +    struct io_uring_task *tctx = NULL;
>>> +
>>> +    if (ctx->sq_data && ctx->sq_data->thread)
>>> +        tctx = ctx->sq_data->thread->io_uring;
>>
>> without park it's racy, sq_data->thread may become NULL and removed,
>> as well as its ->io_uring.
> I now think that it's ok to queue async work to req->task->io_uring. I
> look through all the OPs, seems only have to take care of async_cancel:
> 
> io_async_cancel(req) {
>    cancel from req->task->io_uring;
>    cancel from ctx->tctx_list
> }
> 
> Given req->task is 'original context', the req to be cancelled may in
> ctx->sq_data->thread->io_uring's iowq. So search the req from
> sqthread->io_uring is needed here. This avoids overload in main code
> path.
> Did I miss something else?

It must be req->task->io_uring, otherwise cancellations will
be broken. And using it should be fine in theory because io-wq/etc.
should be set up by io_uring_add_task_file()

One more problem to the pile is io_req_task_work_add() and notify
mode it choses. Look for IORING_SETUP_SQPOLL in the function.

Also, IOPOLL+SQPOLL io_uring_try_cancel_requests() looks like
may fail (didn't double check it). Look again for IORING_SETUP_SQPOLL.

I'd rather recommend to go over all uses of IORING_SETUP_SQPOLL
and think whether it's flawed.

> 
> 
>>
>>> +    else
>>> +        tctx = req->task->io_uring;
>>>         BUG_ON(!tctx);
>>>       BUG_ON(!tctx->io_wq);
>>
>> [snip]
>>
> 

-- 
Pavel Begunkov