Re: [RFC 0/2] io_uring: examine request result only after completion

Jens Axboe <axboe@xxxxxxxxx> · Fri, 25 Oct 2019 08:18:01 -0600

On 10/25/19 8:07 AM, Jens Axboe wrote:
> On 10/25/19 7:46 AM, Bijan Mottahedeh wrote:
>>
>> On 10/24/19 3:31 PM, Jens Axboe wrote:
>>> On 10/24/19 1:18 PM, Bijan Mottahedeh wrote:
>>>> On 10/24/19 10:09 AM, Jens Axboe wrote:
>>>>> On 10/24/19 3:18 AM, Bijan Mottahedeh wrote:
>>>>>> Running an fio test consistenly crashes the kernel with the trace included
>>>>>> below.  The root cause seems to be the code in __io_submit_sqe() that
>>>>>> checks the result of a request for -EAGAIN in polled mode, without
>>>>>> ensuring first that the request has completed:
>>>>>>
>>>>>> 	if (ctx->flags & IORING_SETUP_IOPOLL) {
>>>>>> 		if (req->result == -EAGAIN)
>>>>>> 			return -EAGAIN;
>>>>> I'm a little confused, because we should be holding the submission
>>>>> reference to the request still at this point. So how is it going away?
>>>>> I must be missing something...
>>>> I don't think the submission reference is going away...
>>>>
>>>> I *think* the problem has to do with the fact that
>>>> io_complete_rw_iopoll() which sets REQ_F_IOPOLL_COMPLETED is being
>>>> called from interrupt context in my configuration and so there is a
>>>> potential race between updating the request there and checking it in
>>>> __io_submit_sqe().
>>>>
>>>> My first workaround was to simply poll for REQ_F_IOPOLL_COMPLETED in the
>>>> code snippet above:
>>>>
>>>>         if (req->result == --EAGAIN) {
>>>>
>>>>             poll for REQ_F_IOPOLL_COMPLETED
>>>>
>>>>             return -EAGAIN;
>>>>
>>>> }
>>>>
>>>> and that got rid of the problem.
>>> But that will not work at all for a proper poll setup, where you don't
>>> trigger any IRQs... It only happens to work for this case because you're
>>> still triggering interrupts. But even in that case, it's not a real
>>> solution, but I don't think that's the argument here ;-)
>>
>> Sure.
>>
>> I'm just curious though as how it would break the poll case because
>> io_complete_rw_iopoll() would still be called though through polling,
>> REQ_F_IOPOLL_COMPLETED would be set, and so io_iopoll_complete()
>> should be able to reliably check req->result.
> 
> It'd break the poll case because the task doing the submission is
> generally also the one that finds and reaps completion. Hence if you
> block that task just polling on that completion bit, you are preventing
> that very task from going and reaping completions. The condition would
> never become true, and you are now looping forever.
> 
>> The same poll test seemed to run ok with nvme interrupts not being
>> triggered. Anyway, no argument that it's not needed!
> 
> A few reasons why it would make progress:
> 
> - You eventually trigger a timeout on the nvme side, as blk-mq finds the
>    request hasn't been completed by an IRQ. But that's a 30 second ordeal
>    before that event occurs.
> 
> - There was still interrupts enabled.
> 
> - You have two threads, one doing submission and one doing completions.
>    Maybe using SQPOLL? If that's the case, then yes, it'd still work as
>    you have separate threads for submission and completion.
> 
> For the "generic" case of just using one thread and IRQs disabled, it'd
> deadlock.
> 
>>> I see what the race is now, it's specific to IRQ driven polling. We
>>> really should just disallow that, to be honest, it doesn't make any
>>> sense. But let me think about if we can do a reasonable solution to this
>>> that doesn't involve adding overhead for a proper setup.
>>
>> It's a nonsensical config in a way and so disallowing it would make
>> the most sense.
> 
> Definitely. The nvme driver should not set .poll() if it doesn't have
> non-irq poll queues. Something like this:

Actually, we already disable polling if we don't have specific poll
queues:

        if (set->nr_maps > HCTX_TYPE_POLL &&
            set->map[HCTX_TYPE_POLL].nr_queues)
                blk_queue_flag_set(QUEUE_FLAG_POLL, q);

Did you see any timeouts in your tests? I wonder if the use-after-free
triggered when the timeout found the request while you had the busy-spin
logic we discussed previously.

-- 
Jens Axboe