Re: [PATCH] io_uring: use task_work for links if possible

Pavel Begunkov <asml.silence@xxxxxxxxx> · Sat, 27 Jun 2020 00:20:18 +0300

On 26/06/2020 23:43, Jens Axboe wrote:
> On 6/26/20 2:27 PM, Pavel Begunkov wrote:
>> On 25/06/2020 21:27, Jens Axboe wrote:
>>> Currently links are always done in an async fashion, unless we
>>> catch them inline after we successfully complete a request without
>>> having to resort to blocking. This isn't necessarily the most efficient
>>> approach, it'd be more ideal if we could just use the task_work handling
>>> for this.
>>>
>>> Outside of saving an async jump, we can also do less prep work for
>>> these kinds of requests.
>>>
>>> Running dependent links from the task_work handler yields some nice
>>> performance benefits. As an example, examples/link-cp from the liburing
>>> repository uses read+write links to implement a copy operation. Without
>>> this patch, the a cache fold 4G file read from a VM runs in about
>>> 3 seconds:
>>
>> A few comments below
>>
>>
>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>> index 0bba12e4e559..389274a078c8 100644
>>> --- a/fs/io_uring.c
>>> +++ b/fs/io_uring.c
>>> @@ -898,6 +898,7 @@ enum io_mem_account {
>> ...
>>> +static void __io_req_task_submit(struct io_kiocb *req)
>>> +{
>>> +	struct io_ring_ctx *ctx = req->ctx;
>>> +
>>> +	__set_current_state(TASK_RUNNING);
>>> +	if (!io_sq_thread_acquire_mm(ctx, req)) {
>>> +		mutex_lock(&ctx->uring_lock);
>>> +		__io_queue_sqe(req, NULL, NULL);
>>> +		mutex_unlock(&ctx->uring_lock);
>>> +	} else {
>>> +		__io_req_task_cancel(req, -EFAULT);
>>> +	}
>>> +}
>>> +
>>> +static void io_req_task_submit(struct callback_head *cb)
>>> +{
>>> +	struct io_kiocb *req = container_of(cb, struct io_kiocb, task_work);
>>> +
>>> +	__io_req_task_submit(req);
>>> +}
>>> +
>>> +static void io_req_task_queue(struct io_kiocb *req)
>>> +{
>>> +	struct task_struct *tsk = req->task;
>>> +	int ret;
>>> +
>>> +	init_task_work(&req->task_work, io_req_task_submit);
>>> +
>>> +	ret = task_work_add(tsk, &req->task_work, true);
>>> +	if (unlikely(ret)) {
>>> +		init_task_work(&req->task_work, io_req_task_cancel);
>>
>> Why not to kill it off here? It just was nxt, so shouldn't anything like
>> NOCANCEL
> 
> We don't necessarily know the context, and we'd at least need to check
> REQ_F_COMP_LOCKED and propagate. I think it gets ugly really quick,
> better to just punt it to clean context.

Makes sense

> 
>>> +		tsk = io_wq_get_task(req->ctx->io_wq);
>>> +		task_work_add(tsk, &req->task_work, true);
>>> +	}
>>> +	wake_up_process(tsk);
>>> +}
>>> +
>>>  static void io_free_req(struct io_kiocb *req)
>>>  {
>>>  	struct io_kiocb *nxt = NULL;
>>> @@ -1671,8 +1758,12 @@ static void io_free_req(struct io_kiocb *req)
>>>  	io_req_find_next(req, &nxt);
>>>  	__io_free_req(req);
>>>  
>>> -	if (nxt)
>>> -		io_queue_async_work(nxt);
>>> +	if (nxt) {
>>> +		if (nxt->flags & REQ_F_WORK_INITIALIZED)
>>> +			io_queue_async_work(nxt);
>>
>> Don't think it will work. E.g. io_close_prep() may have set
>> REQ_F_WORK_INITIALIZED but without io_req_work_grab_env().
> 
> This really doesn't change the existing path, it just makes sure we
> don't do io_req_task_queue() on something that has already modified
> ->work (and hence, ->task_work). This might miss cases where we have
> only cleared it and done nothing else, but that just means we'll have
> cases that we could potentially improve the effiency of down the line.

Before the patch it was always initialising linked reqs, and that would
work ok, if not this lazy grab_env().

E.g. req1 -> close_req

It calls, io_req_defer_prep(__close_req__, sqe, __false__)
which doesn't do grab_env() because of for_async=false,
but calls io_close_prep() which sets REQ_F_WORK_INITIALIZED.

Then, after completion of req1 it will follow added lines

if (nxt)
	if (nxt->flags & REQ_F_WORK_INITIALIZED)
		io_queue_async_work(nxt);

Ending up in

io_queue_async_work()
	-> grab_env()

And that's who knows from which context.
E.g. req1 was an rw completed in an irq.

Not sure it's related, but fallocate shows the log below, and some other tests
hang the kernel as well.

[   42.445719] BUG: kernel NULL pointer dereference, address: 0000000000000000
[   42.445723] #PF: supervisor write access in kernel mode
[   42.445725] #PF: error_code(0x0002) - not-present page
[   42.445726] PGD 0 P4D 0 
[   42.445729] Oops: 0002 [#1] PREEMPT SMP PTI
[   42.445733] CPU: 1 PID: 1511 Comm: io_wqe_worker-0 Tainted: G          I       5.8.0-rc2-00358-g9d2391dd5359 #489
[   42.445740] RIP: 0010:override_creds+0x19/0x30
...
[   42.445754] Call Trace:
[   42.445758]  io_worker_handle_work+0x25c/0x430
[   42.445760]  io_wqe_worker+0x2a0/0x350
[   42.445764]  ? _raw_spin_unlock_irqrestore+0x24/0x40
[   42.445766]  ? io_worker_handle_work+0x430/0x430
[   42.445769]  kthread+0x136/0x180
[   42.445771]  ? kthread_park+0x90/0x90
[   42.445774]  ret_from_fork+0x22/0x30

> 
>>> -static inline void req_set_fail_links(struct io_kiocb *req)
>>> -{
>>> -	if ((req->flags & (REQ_F_LINK | REQ_F_HARDLINK)) == REQ_F_LINK)
>>> -		req->flags |= REQ_F_FAIL_LINK;
>>> -}
>>> -
>>
>> I think it'd be nicer in 2 patches, first moving io_sq_thread_drop_mm, etc. up.
>> And the second one doing actual work. 
> 
> Yeah I agree...
> 
>>>  static void io_complete_rw_common(struct kiocb *kiocb, long res,
>>>  				  struct io_comp_state *cs)
>>>  {
>>> @@ -2035,35 +2120,6 @@ static void io_complete_rw_common(struct kiocb *kiocb, long res,
>>>  	__io_req_complete(req, res, cflags, cs);
>>>  }
>>>  
>> ...
>>>  	switch (req->opcode) {
>>>  	case IORING_OP_NOP:
>>> @@ -5347,7 +5382,7 @@ static int io_req_defer(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>  	if (!req->io) {
>>>  		if (io_alloc_async_ctx(req))
>>>  			return -EAGAIN;
>>> -		ret = io_req_defer_prep(req, sqe);
>>> +		ret = io_req_defer_prep(req, sqe, true);
>>
>> Why head of a link is for_async?
> 
> True, that could be false instead.
> 
> Since these are just minor things, we can do a fix on top. I don't want
> to reshuffle this unless I have to.

Agree, I have a pile on top myself.

-- 
Pavel Begunkov