Re: [RFC 0/2] io_uring: examine request result only after completion

Jens Axboe <axboe@xxxxxxxxx> · Tue, 29 Oct 2019 19:02:23 -0600

On 10/29/19 1:52 PM, Jens Axboe wrote:
> On 10/29/19 1:51 PM, Bijan Mottahedeh wrote:
>>
>> On 10/29/19 12:46 PM, Jens Axboe wrote:
>>> On 10/29/19 1:40 PM, Bijan Mottahedeh wrote:
>>>> On 10/29/19 12:33 PM, Jens Axboe wrote:
>>>>> On 10/29/19 1:31 PM, Bijan Mottahedeh wrote:
>>>>>> On 10/29/19 12:27 PM, Jens Axboe wrote:
>>>>>>> On 10/29/19 1:23 PM, Bijan Mottahedeh wrote:
>>>>>>>> On 10/29/19 12:17 PM, Bijan Mottahedeh wrote:
>>>>>>>>> On 10/25/19 7:21 AM, Jens Axboe wrote:
>>>>>>>>>> On 10/25/19 8:18 AM, Jens Axboe wrote:
>>>>>>>>>>> On 10/25/19 8:07 AM, Jens Axboe wrote:
>>>>>>>>>>>> On 10/25/19 7:46 AM, Bijan Mottahedeh wrote:
>>>>>>>>>>>>> On 10/24/19 3:31 PM, Jens Axboe wrote:
>>>>>>>>>>>>>> On 10/24/19 1:18 PM, Bijan Mottahedeh wrote:
>>>>>>>>>>>>>>> On 10/24/19 10:09 AM, Jens Axboe wrote:
>>>>>>>>>>>>>>>> On 10/24/19 3:18 AM, Bijan Mottahedeh wrote:
>>>>>>>>>>>>>>>>> Running an fio test consistenly crashes the kernel with the
>>>>>>>>>>>>>>>>> trace included
>>>>>>>>>>>>>>>>> below.  The root cause seems to be the code in
>>>>>>>>>>>>>>>>> __io_submit_sqe() that
>>>>>>>>>>>>>>>>> checks the result of a request for -EAGAIN in polled mode,
>>>>>>>>>>>>>>>>> without
>>>>>>>>>>>>>>>>> ensuring first that the request has completed:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>             if (ctx->flags & IORING_SETUP_IOPOLL) {
>>>>>>>>>>>>>>>>>                 if (req->result == -EAGAIN)
>>>>>>>>>>>>>>>>>                     return -EAGAIN;
>>>>>>>>>>>>>>>> I'm a little confused, because we should be holding the submission
>>>>>>>>>>>>>>>> reference to the request still at this point. So how is it
>>>>>>>>>>>>>>>> going away?
>>>>>>>>>>>>>>>> I must be missing something...
>>>>>>>>>>>>>>> I don't think the submission reference is going away...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I *think* the problem has to do with the fact that
>>>>>>>>>>>>>>> io_complete_rw_iopoll() which sets REQ_F_IOPOLL_COMPLETED is being
>>>>>>>>>>>>>>> called from interrupt context in my configuration and so there is a
>>>>>>>>>>>>>>> potential race between updating the request there and checking
>>>>>>>>>>>>>>> it in
>>>>>>>>>>>>>>> __io_submit_sqe().
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> My first workaround was to simply poll for
>>>>>>>>>>>>>>> REQ_F_IOPOLL_COMPLETED in the
>>>>>>>>>>>>>>> code snippet above:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>                   if (req->result == --EAGAIN) {
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>                       poll for REQ_F_IOPOLL_COMPLETED
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>                       return -EAGAIN;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> and that got rid of the problem.
>>>>>>>>>>>>>> But that will not work at all for a proper poll setup, where you
>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>> trigger any IRQs... It only happens to work for this case because
>>>>>>>>>>>>>> you're
>>>>>>>>>>>>>> still triggering interrupts. But even in that case, it's not a real
>>>>>>>>>>>>>> solution, but I don't think that's the argument here ;-)
>>>>>>>>>>>>> Sure.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm just curious though as how it would break the poll case because
>>>>>>>>>>>>> io_complete_rw_iopoll() would still be called though through polling,
>>>>>>>>>>>>> REQ_F_IOPOLL_COMPLETED would be set, and so io_iopoll_complete()
>>>>>>>>>>>>> should be able to reliably check req->result.
>>>>>>>>>>>> It'd break the poll case because the task doing the submission is
>>>>>>>>>>>> generally also the one that finds and reaps completion. Hence if you
>>>>>>>>>>>> block that task just polling on that completion bit, you are
>>>>>>>>>>>> preventing
>>>>>>>>>>>> that very task from going and reaping completions. The condition would
>>>>>>>>>>>> never become true, and you are now looping forever.
>>>>>>>>>>>>
>>>>>>>>>>>>> The same poll test seemed to run ok with nvme interrupts not being
>>>>>>>>>>>>> triggered. Anyway, no argument that it's not needed!
>>>>>>>>>>>> A few reasons why it would make progress:
>>>>>>>>>>>>
>>>>>>>>>>>> - You eventually trigger a timeout on the nvme side, as blk-mq
>>>>>>>>>>>> finds the
>>>>>>>>>>>>              request hasn't been completed by an IRQ. But that's a 30
>>>>>>>>>>>> second ordeal
>>>>>>>>>>>>              before that event occurs.
>>>>>>>>>>>>
>>>>>>>>>>>> - There was still interrupts enabled.
>>>>>>>>>>>>
>>>>>>>>>>>> - You have two threads, one doing submission and one doing
>>>>>>>>>>>> completions.
>>>>>>>>>>>>              Maybe using SQPOLL? If that's the case, then yes, it'd still
>>>>>>>>>>>> work as
>>>>>>>>>>>>              you have separate threads for submission and completion.
>>>>>>>>>>>>
>>>>>>>>>>>> For the "generic" case of just using one thread and IRQs disabled,
>>>>>>>>>>>> it'd
>>>>>>>>>>>> deadlock.
>>>>>>>>>>>>
>>>>>>>>>>>>>> I see what the race is now, it's specific to IRQ driven polling. We
>>>>>>>>>>>>>> really should just disallow that, to be honest, it doesn't make any
>>>>>>>>>>>>>> sense. But let me think about if we can do a reasonable solution
>>>>>>>>>>>>>> to this
>>>>>>>>>>>>>> that doesn't involve adding overhead for a proper setup.
>>>>>>>>>>>>> It's a nonsensical config in a way and so disallowing it would make
>>>>>>>>>>>>> the most sense.
>>>>>>>>>>>> Definitely. The nvme driver should not set .poll() if it doesn't have
>>>>>>>>>>>> non-irq poll queues. Something like this:
>>>>>>>>>>> Actually, we already disable polling if we don't have specific poll
>>>>>>>>>>> queues:
>>>>>>>>>>>
>>>>>>>>>>>                   if (set->nr_maps > HCTX_TYPE_POLL &&
>>>>>>>>>>>                       set->map[HCTX_TYPE_POLL].nr_queues)
>>>>>>>>>>>                           blk_queue_flag_set(QUEUE_FLAG_POLL, q);
>>>>>>>>>>>
>>>>>>>>>>> Did you see any timeouts in your tests? I wonder if the use-after-free
>>>>>>>>>>> triggered when the timeout found the request while you had the
>>>>>>>>>>> busy-spin
>>>>>>>>>>> logic we discussed previously.
>>>>>>>>>> Ah, but we still have fops->iopoll() set for that case. So we just won't
>>>>>>>>>> poll for it, it'll get completed by IRQ. So I do think we need to handle
>>>>>>>>>> this case in io_uring. I'll get back to you.
>>>>>>>>>>
>>>>>>>>> I ran the same test on linux-next-20191029 in polled mode and got the
>>>>>>>>> same free-after-user panic:
>>>>>>>>>
>>>>>>>>> - I booted with nvme.poll_queues set and verified that all queues
>>>>>>>>> except default where of type poll
>>>>>>>>>
>>>>>>>>> - I added three assertions to verify the following:
>>>>>>>>>
>>>>>>>>>             - nvme_timeout() is not called
>>>>>>>>>
>>>>>>>>>             - io_complete_rw_iopoll() is not called from interrupt context
>>>>>>>>>
>>>>>>>>>             - io_sq_offload_start() is not called with IORING_SETUP_SQPOLL set
>>>>>>>>>
>>>>>>>>> Is it possible that the race is there also in polled mode since a
>>>>>>>>> request submitted by one thread could conceivably be polled for and
>>>>>>>>> completed by a different thread, e.g. in
>>>>>>>>> io_uring_enter()->io_iopoll_check()?
>>>>>>>>>
>>>>>>>>> --bijan
>>>>>>>>>
>>>>>>>>>
>>>>>>>> I also tested my RFC again with 1 thread and with queue depths of 1 to
>>>>>>>> 1024 in multiples of 8 and didn't see any hangs.
>>>>>>>>
>>>>>>>> Just to be clear, the busy-spin logic discussed before was only a
>>>>>>>> workaround an not in the RFC.
>>>>>>> What is your exact test case?
>>>>>>>
>>>>>> See original cover letter.  I can reproduce the failure with numjobs
>>>>>> between 8 and 32.
>>>>> And how many poll queues are you using?
>>>>>
>>>> 30
>>> And how many threads/cores in the box? Trying to get a sense for how
>>> many CPUs share a single poll queue, if any.
>>>
>> Thread(s) per core:    2
>> Core(s) per socket:    8
>> Socket(s):             2
> 
> OK, so 2*8*2 == 32 threads, hence some threads will share a poll queue.

OK, so I still don't quite see where the issue is. Your setup has more
than one CPU per poll queue, and I can reproduce the issue quite easily
here with a similar setup. Below are some things that are given:

1) If we fail to submit the IO, io_complete_rw_iopoll() is ultimately
   invoked _from_ the submission path. This means that the result is
   readily available by the time we go and check:

   if (req->result == -EAGAIN)

   in __io_submit_sqe(). This is a submission time failure, not
   something that should be happening from a completion path after the
   IO has been submitted successfully.

2) If the succeed in submitting the request, given that we have other
   tasks polling, the request can complete any time. It can very well be
   complete by the time we call io_iopoll_req_issued(), and this is
   perfectly fine. We know the request isn't going away, as we're
   holding a reference to it. kiocb has the same lifetime, as it's
   embedded in the io_kiocb request. Note that this isn't the same
   io_ring_ctx at all, some other task with its own io_ring_ctx just
   happens to find our request when doing polling on the same queue
   itself.

We would definitely get in trouble if we submitted the request
successfully, but returned -EAGAIN because we thought we didn't.

In my testing, what I seem to see is double completions on the block
layer side, and double issues. I can't quite get that to match up with
anything...

I'll keep digging, hopefully I'll get some deeper understanding of what
exactly the issue is shortly. I was hoping I'd get that by writing my
thoughts in this email, but alas that didn't happen yet.

-- 
Jens Axboe