On 10/29/19 1:23 PM, Bijan Mottahedeh wrote: > > On 10/29/19 12:17 PM, Bijan Mottahedeh wrote: >> >> On 10/25/19 7:21 AM, Jens Axboe wrote: >>> On 10/25/19 8:18 AM, Jens Axboe wrote: >>>> On 10/25/19 8:07 AM, Jens Axboe wrote: >>>>> On 10/25/19 7:46 AM, Bijan Mottahedeh wrote: >>>>>> On 10/24/19 3:31 PM, Jens Axboe wrote: >>>>>>> On 10/24/19 1:18 PM, Bijan Mottahedeh wrote: >>>>>>>> On 10/24/19 10:09 AM, Jens Axboe wrote: >>>>>>>>> On 10/24/19 3:18 AM, Bijan Mottahedeh wrote: >>>>>>>>>> Running an fio test consistenly crashes the kernel with the >>>>>>>>>> trace included >>>>>>>>>> below. The root cause seems to be the code in >>>>>>>>>> __io_submit_sqe() that >>>>>>>>>> checks the result of a request for -EAGAIN in polled mode, >>>>>>>>>> without >>>>>>>>>> ensuring first that the request has completed: >>>>>>>>>> >>>>>>>>>> if (ctx->flags & IORING_SETUP_IOPOLL) { >>>>>>>>>> if (req->result == -EAGAIN) >>>>>>>>>> return -EAGAIN; >>>>>>>>> I'm a little confused, because we should be holding the submission >>>>>>>>> reference to the request still at this point. So how is it >>>>>>>>> going away? >>>>>>>>> I must be missing something... >>>>>>>> I don't think the submission reference is going away... >>>>>>>> >>>>>>>> I *think* the problem has to do with the fact that >>>>>>>> io_complete_rw_iopoll() which sets REQ_F_IOPOLL_COMPLETED is being >>>>>>>> called from interrupt context in my configuration and so there is a >>>>>>>> potential race between updating the request there and checking >>>>>>>> it in >>>>>>>> __io_submit_sqe(). >>>>>>>> >>>>>>>> My first workaround was to simply poll for >>>>>>>> REQ_F_IOPOLL_COMPLETED in the >>>>>>>> code snippet above: >>>>>>>> >>>>>>>> if (req->result == --EAGAIN) { >>>>>>>> >>>>>>>> poll for REQ_F_IOPOLL_COMPLETED >>>>>>>> >>>>>>>> return -EAGAIN; >>>>>>>> >>>>>>>> } >>>>>>>> >>>>>>>> and that got rid of the problem. >>>>>>> But that will not work at all for a proper poll setup, where you >>>>>>> don't >>>>>>> trigger any IRQs... It only happens to work for this case because >>>>>>> you're >>>>>>> still triggering interrupts. But even in that case, it's not a real >>>>>>> solution, but I don't think that's the argument here ;-) >>>>>> Sure. >>>>>> >>>>>> I'm just curious though as how it would break the poll case because >>>>>> io_complete_rw_iopoll() would still be called though through polling, >>>>>> REQ_F_IOPOLL_COMPLETED would be set, and so io_iopoll_complete() >>>>>> should be able to reliably check req->result. >>>>> It'd break the poll case because the task doing the submission is >>>>> generally also the one that finds and reaps completion. Hence if you >>>>> block that task just polling on that completion bit, you are >>>>> preventing >>>>> that very task from going and reaping completions. The condition would >>>>> never become true, and you are now looping forever. >>>>> >>>>>> The same poll test seemed to run ok with nvme interrupts not being >>>>>> triggered. Anyway, no argument that it's not needed! >>>>> A few reasons why it would make progress: >>>>> >>>>> - You eventually trigger a timeout on the nvme side, as blk-mq >>>>> finds the >>>>> request hasn't been completed by an IRQ. But that's a 30 >>>>> second ordeal >>>>> before that event occurs. >>>>> >>>>> - There was still interrupts enabled. >>>>> >>>>> - You have two threads, one doing submission and one doing >>>>> completions. >>>>> Maybe using SQPOLL? If that's the case, then yes, it'd still >>>>> work as >>>>> you have separate threads for submission and completion. >>>>> >>>>> For the "generic" case of just using one thread and IRQs disabled, >>>>> it'd >>>>> deadlock. >>>>> >>>>>>> I see what the race is now, it's specific to IRQ driven polling. We >>>>>>> really should just disallow that, to be honest, it doesn't make any >>>>>>> sense. But let me think about if we can do a reasonable solution >>>>>>> to this >>>>>>> that doesn't involve adding overhead for a proper setup. >>>>>> It's a nonsensical config in a way and so disallowing it would make >>>>>> the most sense. >>>>> Definitely. The nvme driver should not set .poll() if it doesn't have >>>>> non-irq poll queues. Something like this: >>>> Actually, we already disable polling if we don't have specific poll >>>> queues: >>>> >>>> if (set->nr_maps > HCTX_TYPE_POLL && >>>> set->map[HCTX_TYPE_POLL].nr_queues) >>>> blk_queue_flag_set(QUEUE_FLAG_POLL, q); >>>> >>>> Did you see any timeouts in your tests? I wonder if the use-after-free >>>> triggered when the timeout found the request while you had the >>>> busy-spin >>>> logic we discussed previously. >>> Ah, but we still have fops->iopoll() set for that case. So we just won't >>> poll for it, it'll get completed by IRQ. So I do think we need to handle >>> this case in io_uring. I'll get back to you. >>> >> >> I ran the same test on linux-next-20191029 in polled mode and got the >> same free-after-user panic: >> >> - I booted with nvme.poll_queues set and verified that all queues >> except default where of type poll >> >> - I added three assertions to verify the following: >> >> - nvme_timeout() is not called >> >> - io_complete_rw_iopoll() is not called from interrupt context >> >> - io_sq_offload_start() is not called with IORING_SETUP_SQPOLL set >> >> Is it possible that the race is there also in polled mode since a >> request submitted by one thread could conceivably be polled for and >> completed by a different thread, e.g. in >> io_uring_enter()->io_iopoll_check()? >> >> --bijan >> >> > I also tested my RFC again with 1 thread and with queue depths of 1 to > 1024 in multiples of 8 and didn't see any hangs. > > Just to be clear, the busy-spin logic discussed before was only a > workaround an not in the RFC. What is your exact test case? -- Jens Axboe