Re: [RFC 0/2] io_uring: examine request result only after completion

Jens Axboe <axboe@xxxxxxxxx> · Tue, 29 Oct 2019 13:52:57 -0600

On 10/29/19 1:51 PM, Bijan Mottahedeh wrote:
> 
> On 10/29/19 12:46 PM, Jens Axboe wrote:
>> On 10/29/19 1:40 PM, Bijan Mottahedeh wrote:
>>> On 10/29/19 12:33 PM, Jens Axboe wrote:
>>>> On 10/29/19 1:31 PM, Bijan Mottahedeh wrote:
>>>>> On 10/29/19 12:27 PM, Jens Axboe wrote:
>>>>>> On 10/29/19 1:23 PM, Bijan Mottahedeh wrote:
>>>>>>> On 10/29/19 12:17 PM, Bijan Mottahedeh wrote:
>>>>>>>> On 10/25/19 7:21 AM, Jens Axboe wrote:
>>>>>>>>> On 10/25/19 8:18 AM, Jens Axboe wrote:
>>>>>>>>>> On 10/25/19 8:07 AM, Jens Axboe wrote:
>>>>>>>>>>> On 10/25/19 7:46 AM, Bijan Mottahedeh wrote:
>>>>>>>>>>>> On 10/24/19 3:31 PM, Jens Axboe wrote:
>>>>>>>>>>>>> On 10/24/19 1:18 PM, Bijan Mottahedeh wrote:
>>>>>>>>>>>>>> On 10/24/19 10:09 AM, Jens Axboe wrote:
>>>>>>>>>>>>>>> On 10/24/19 3:18 AM, Bijan Mottahedeh wrote:
>>>>>>>>>>>>>>>> Running an fio test consistenly crashes the kernel with the
>>>>>>>>>>>>>>>> trace included
>>>>>>>>>>>>>>>> below.  The root cause seems to be the code in
>>>>>>>>>>>>>>>> __io_submit_sqe() that
>>>>>>>>>>>>>>>> checks the result of a request for -EAGAIN in polled mode,
>>>>>>>>>>>>>>>> without
>>>>>>>>>>>>>>>> ensuring first that the request has completed:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>            if (ctx->flags & IORING_SETUP_IOPOLL) {
>>>>>>>>>>>>>>>>                if (req->result == -EAGAIN)
>>>>>>>>>>>>>>>>                    return -EAGAIN;
>>>>>>>>>>>>>>> I'm a little confused, because we should be holding the submission
>>>>>>>>>>>>>>> reference to the request still at this point. So how is it
>>>>>>>>>>>>>>> going away?
>>>>>>>>>>>>>>> I must be missing something...
>>>>>>>>>>>>>> I don't think the submission reference is going away...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I *think* the problem has to do with the fact that
>>>>>>>>>>>>>> io_complete_rw_iopoll() which sets REQ_F_IOPOLL_COMPLETED is being
>>>>>>>>>>>>>> called from interrupt context in my configuration and so there is a
>>>>>>>>>>>>>> potential race between updating the request there and checking
>>>>>>>>>>>>>> it in
>>>>>>>>>>>>>> __io_submit_sqe().
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> My first workaround was to simply poll for
>>>>>>>>>>>>>> REQ_F_IOPOLL_COMPLETED in the
>>>>>>>>>>>>>> code snippet above:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                  if (req->result == --EAGAIN) {
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                      poll for REQ_F_IOPOLL_COMPLETED
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                      return -EAGAIN;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> and that got rid of the problem.
>>>>>>>>>>>>> But that will not work at all for a proper poll setup, where you
>>>>>>>>>>>>> don't
>>>>>>>>>>>>> trigger any IRQs... It only happens to work for this case because
>>>>>>>>>>>>> you're
>>>>>>>>>>>>> still triggering interrupts. But even in that case, it's not a real
>>>>>>>>>>>>> solution, but I don't think that's the argument here ;-)
>>>>>>>>>>>> Sure.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm just curious though as how it would break the poll case because
>>>>>>>>>>>> io_complete_rw_iopoll() would still be called though through polling,
>>>>>>>>>>>> REQ_F_IOPOLL_COMPLETED would be set, and so io_iopoll_complete()
>>>>>>>>>>>> should be able to reliably check req->result.
>>>>>>>>>>> It'd break the poll case because the task doing the submission is
>>>>>>>>>>> generally also the one that finds and reaps completion. Hence if you
>>>>>>>>>>> block that task just polling on that completion bit, you are
>>>>>>>>>>> preventing
>>>>>>>>>>> that very task from going and reaping completions. The condition would
>>>>>>>>>>> never become true, and you are now looping forever.
>>>>>>>>>>>
>>>>>>>>>>>> The same poll test seemed to run ok with nvme interrupts not being
>>>>>>>>>>>> triggered. Anyway, no argument that it's not needed!
>>>>>>>>>>> A few reasons why it would make progress:
>>>>>>>>>>>
>>>>>>>>>>> - You eventually trigger a timeout on the nvme side, as blk-mq
>>>>>>>>>>> finds the
>>>>>>>>>>>             request hasn't been completed by an IRQ. But that's a 30
>>>>>>>>>>> second ordeal
>>>>>>>>>>>             before that event occurs.
>>>>>>>>>>>
>>>>>>>>>>> - There was still interrupts enabled.
>>>>>>>>>>>
>>>>>>>>>>> - You have two threads, one doing submission and one doing
>>>>>>>>>>> completions.
>>>>>>>>>>>             Maybe using SQPOLL? If that's the case, then yes, it'd still
>>>>>>>>>>> work as
>>>>>>>>>>>             you have separate threads for submission and completion.
>>>>>>>>>>>
>>>>>>>>>>> For the "generic" case of just using one thread and IRQs disabled,
>>>>>>>>>>> it'd
>>>>>>>>>>> deadlock.
>>>>>>>>>>>
>>>>>>>>>>>>> I see what the race is now, it's specific to IRQ driven polling. We
>>>>>>>>>>>>> really should just disallow that, to be honest, it doesn't make any
>>>>>>>>>>>>> sense. But let me think about if we can do a reasonable solution
>>>>>>>>>>>>> to this
>>>>>>>>>>>>> that doesn't involve adding overhead for a proper setup.
>>>>>>>>>>>> It's a nonsensical config in a way and so disallowing it would make
>>>>>>>>>>>> the most sense.
>>>>>>>>>>> Definitely. The nvme driver should not set .poll() if it doesn't have
>>>>>>>>>>> non-irq poll queues. Something like this:
>>>>>>>>>> Actually, we already disable polling if we don't have specific poll
>>>>>>>>>> queues:
>>>>>>>>>>
>>>>>>>>>>                  if (set->nr_maps > HCTX_TYPE_POLL &&
>>>>>>>>>>                      set->map[HCTX_TYPE_POLL].nr_queues)
>>>>>>>>>>                          blk_queue_flag_set(QUEUE_FLAG_POLL, q);
>>>>>>>>>>
>>>>>>>>>> Did you see any timeouts in your tests? I wonder if the use-after-free
>>>>>>>>>> triggered when the timeout found the request while you had the
>>>>>>>>>> busy-spin
>>>>>>>>>> logic we discussed previously.
>>>>>>>>> Ah, but we still have fops->iopoll() set for that case. So we just won't
>>>>>>>>> poll for it, it'll get completed by IRQ. So I do think we need to handle
>>>>>>>>> this case in io_uring. I'll get back to you.
>>>>>>>>>
>>>>>>>> I ran the same test on linux-next-20191029 in polled mode and got the
>>>>>>>> same free-after-user panic:
>>>>>>>>
>>>>>>>> - I booted with nvme.poll_queues set and verified that all queues
>>>>>>>> except default where of type poll
>>>>>>>>
>>>>>>>> - I added three assertions to verify the following:
>>>>>>>>
>>>>>>>>            - nvme_timeout() is not called
>>>>>>>>
>>>>>>>>            - io_complete_rw_iopoll() is not called from interrupt context
>>>>>>>>
>>>>>>>>            - io_sq_offload_start() is not called with IORING_SETUP_SQPOLL set
>>>>>>>>
>>>>>>>> Is it possible that the race is there also in polled mode since a
>>>>>>>> request submitted by one thread could conceivably be polled for and
>>>>>>>> completed by a different thread, e.g. in
>>>>>>>> io_uring_enter()->io_iopoll_check()?
>>>>>>>>
>>>>>>>> --bijan
>>>>>>>>
>>>>>>>>
>>>>>>> I also tested my RFC again with 1 thread and with queue depths of 1 to
>>>>>>> 1024 in multiples of 8 and didn't see any hangs.
>>>>>>>
>>>>>>> Just to be clear, the busy-spin logic discussed before was only a
>>>>>>> workaround an not in the RFC.
>>>>>> What is your exact test case?
>>>>>>
>>>>> See original cover letter.  I can reproduce the failure with numjobs
>>>>> between 8 and 32.
>>>> And how many poll queues are you using?
>>>>
>>> 30
>> And how many threads/cores in the box? Trying to get a sense for how
>> many CPUs share a single poll queue, if any.
>>
> Thread(s) per core:    2
> Core(s) per socket:    8
> Socket(s):             2

OK, so 2*8*2 == 32 threads, hence some threads will share a poll queue.

-- 
Jens Axboe