Re: [RFC 0/2] io_uring: examine request result only after completion

Jens Axboe <axboe@xxxxxxxxx> · Wed, 30 Oct 2019 13:26:21 -0600

On 10/30/19 1:21 PM, Bijan Mottahedeh wrote:
> 
> On 10/30/19 10:32 AM, Jens Axboe wrote:
>> On 10/30/19 8:18 AM, Jens Axboe wrote:
>>> On 10/30/19 8:02 AM, Bijan Mottahedeh wrote:
>>>>> OK, so I still don't quite see where the issue is. Your setup has more
>>>>> than one CPU per poll queue, and I can reproduce the issue quite easily
>>>>> here with a similar setup.
>>>> That's probably why I couldn't reproduce this in a vm.  This time I set
>>>> up one poll queue in a 8 cpu vm and reproduced it.
>>> You definitely do need poll queue sharing to hit this.
>>>
>>>>> Below are some things that are given:
>>>>>
>>>>> 1) If we fail to submit the IO, io_complete_rw_iopoll() is ultimately
>>>>>         invoked _from_ the submission path. This means that the result is
>>>>>         readily available by the time we go and check:
>>>>>
>>>>>         if (req->result == -EAGAIN)
>>>>>
>>>>>         in __io_submit_sqe().
>>>> Is that always true?
>>>>
>>>> Let's say the operation was __io_submit_sqe()->io_read()
>>>>
>>>> By "failing to submit the io", do you mean that
>>>> io_read()->call_read_iter() returned success or failure?  Are you saying
>>>> that req->result was set from kiocb_done() or later in the block layer?
>>> By "failed to submit" I mean that req->result == -EAGAIN. We set that in
>>> io_complete_rw_iopoll(), which is called off bio_wouldblock_error() in
>>> the block layer. This is different from an ret != 0 return, which would
>>> have been some sort of other failure we encountered, failing to submit
>>> the request. For that error, we just end the request. For the
>>> bio_wouldblock_error(), we need to retry submission.
>>>
>>>> Anyway I assume that io_read() would have to return success since
>>>> otherwise __io_submit_sqe() would immediately return and not check
>>>> req->result:
>>>>
>>>>              if (ret)
>>>>                      return ret;
>>> Right, if ret != 0, then we have a fatal error for that request.
>>> ->result will not have been set at all for that case.
>>>
>>>> So if io_read() did return success,  are we guaranteed that setting
>>>> req->result = -EAGAIN would always happen before the check?
>>> Always, since it happens inline from when you attempt to issue the read.
>>>
>>>> Also, is it possible that we can be preempted in __io_submit_sqe() after
>>>> the call to io_read() but before the -EAGAIN check?
>>> Sure, we're not disabling preemption. But that shouldn't have any
>>> bearing on this case.
>>>>> This is a submission time failure, not
>>>>>         something that should be happening from a completion path after the
>>>>>         IO has been submitted successfully.
>>>>>
>>>>> 2) If the succeed in submitting the request, given that we have other
>>>>>         tasks polling, the request can complete any time. It can very well be
>>>>>         complete by the time we call io_iopoll_req_issued(), and this is
>>>>>         perfectly fine. We know the request isn't going away, as we're
>>>>>         holding a reference to it. kiocb has the same lifetime, as it's
>>>>>         embedded in the io_kiocb request. Note that this isn't the same
>>>>>         io_ring_ctx at all, some other task with its own io_ring_ctx just
>>>>>         happens to find our request when doing polling on the same queue
>>>>>         itself.
>>>> Ah yes, it's a different io_ring_ctx, different poll list etc. For my
>>> Exactly
>>>
>>>> own clarity, I assume all contexts are mapping the same actual sq/cq
>>>> rings?
>>> The ring/context isn't mapped to a particular ring, a specific IO is
>>> mapped to a specific queue at the time it's being submitted. That
>>> depends on the IO type and where that task happens to be running at the
>>> time the IO is being submitted.
>> This might just do it, except it looks like there's a bug in sbitmap
>> where we don't always flush deferred clears. I'll look into that now.
>>
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index af1937d66aee..f3ca2ee44dbd 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -1212,6 +1212,7 @@ static int io_prep_rw(struct io_kiocb *req, const struct sqe_submit *s,
>>    
>>    		kiocb->ki_flags |= IOCB_HIPRI;
>>    		kiocb->ki_complete = io_complete_rw_iopoll;
>> +		req->result = 0;
>>    	} else {
>>    		if (kiocb->ki_flags & IOCB_HIPRI)
>>    			return -EINVAL;
>>
> I checked out a few configs and didn't hit the error.  However, with 16
> poll queues and 32 fio jobs, I eventually started getting a bunch of
> nvme timeout/abort messages and the system seemed to be stuck in a poll
> loop.  IKilling the test didn't work and the system was hung:
> 
> ...
> 
> [  407.978329] nvme nvme0: I/O 679 QID 30 timeout, aborting
> [  408.085318] nvme nvme0: I/O 680 QID 30 timeout, aborting
> [  408.164145] nvme nvme0: I/O 682 QID 30 timeout, aborting
> [  408.238102] nvme nvme0: I/O 683 QID 30 timeout, aborting
> [  412.970712] nvme nvme0: Abort status: 0x0r=239k IOPS][eta 01m:30s]
> [  413.018696] nvme nvme0: Abort status: 0x0
> [  413.066691] nvme nvme0: Abort status: 0x0
> [  413.114684] nvme nvme0: Abort status: 0x0
> [  438.035324] nvme nvme0: I/O 674 QID 30 timeout, reset controllers]
> [  444.287637] nvme nvme0: 15/0/16 default/read/poll queues
> [  637.073111] INFO: task kworker/u66:0:55 blocked for more than 122
> seconds.
> 
> ...
> 
> fio: terminating on signal 2
> Jobs: 32 (f=32): [r(32)][0.3%][eta 02d:01h:28m:36s]

Right, there's some funkiness going on on the nvme side with shared
poll queues, I'm looking into it right now. The patch I sent only
takes care of the "submit after -EAGAIN was already received" case
that you reported.

-- 
Jens Axboe