On 10/29/19 1:51 PM, Bijan Mottahedeh wrote: > > On 10/29/19 12:46 PM, Jens Axboe wrote: >> On 10/29/19 1:40 PM, Bijan Mottahedeh wrote: >>> On 10/29/19 12:33 PM, Jens Axboe wrote: >>>> On 10/29/19 1:31 PM, Bijan Mottahedeh wrote: >>>>> On 10/29/19 12:27 PM, Jens Axboe wrote: >>>>>> On 10/29/19 1:23 PM, Bijan Mottahedeh wrote: >>>>>>> On 10/29/19 12:17 PM, Bijan Mottahedeh wrote: >>>>>>>> On 10/25/19 7:21 AM, Jens Axboe wrote: >>>>>>>>> On 10/25/19 8:18 AM, Jens Axboe wrote: >>>>>>>>>> On 10/25/19 8:07 AM, Jens Axboe wrote: >>>>>>>>>>> On 10/25/19 7:46 AM, Bijan Mottahedeh wrote: >>>>>>>>>>>> On 10/24/19 3:31 PM, Jens Axboe wrote: >>>>>>>>>>>>> On 10/24/19 1:18 PM, Bijan Mottahedeh wrote: >>>>>>>>>>>>>> On 10/24/19 10:09 AM, Jens Axboe wrote: >>>>>>>>>>>>>>> On 10/24/19 3:18 AM, Bijan Mottahedeh wrote: >>>>>>>>>>>>>>>> Running an fio test consistenly crashes the kernel with the >>>>>>>>>>>>>>>> trace included >>>>>>>>>>>>>>>> below. The root cause seems to be the code in >>>>>>>>>>>>>>>> __io_submit_sqe() that >>>>>>>>>>>>>>>> checks the result of a request for -EAGAIN in polled mode, >>>>>>>>>>>>>>>> without >>>>>>>>>>>>>>>> ensuring first that the request has completed: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> if (ctx->flags & IORING_SETUP_IOPOLL) { >>>>>>>>>>>>>>>> if (req->result == -EAGAIN) >>>>>>>>>>>>>>>> return -EAGAIN; >>>>>>>>>>>>>>> I'm a little confused, because we should be holding the submission >>>>>>>>>>>>>>> reference to the request still at this point. So how is it >>>>>>>>>>>>>>> going away? >>>>>>>>>>>>>>> I must be missing something... >>>>>>>>>>>>>> I don't think the submission reference is going away... >>>>>>>>>>>>>> >>>>>>>>>>>>>> I *think* the problem has to do with the fact that >>>>>>>>>>>>>> io_complete_rw_iopoll() which sets REQ_F_IOPOLL_COMPLETED is being >>>>>>>>>>>>>> called from interrupt context in my configuration and so there is a >>>>>>>>>>>>>> potential race between updating the request there and checking >>>>>>>>>>>>>> it in >>>>>>>>>>>>>> __io_submit_sqe(). >>>>>>>>>>>>>> >>>>>>>>>>>>>> My first workaround was to simply poll for >>>>>>>>>>>>>> REQ_F_IOPOLL_COMPLETED in the >>>>>>>>>>>>>> code snippet above: >>>>>>>>>>>>>> >>>>>>>>>>>>>> if (req->result == --EAGAIN) { >>>>>>>>>>>>>> >>>>>>>>>>>>>> poll for REQ_F_IOPOLL_COMPLETED >>>>>>>>>>>>>> >>>>>>>>>>>>>> return -EAGAIN; >>>>>>>>>>>>>> >>>>>>>>>>>>>> } >>>>>>>>>>>>>> >>>>>>>>>>>>>> and that got rid of the problem. >>>>>>>>>>>>> But that will not work at all for a proper poll setup, where you >>>>>>>>>>>>> don't >>>>>>>>>>>>> trigger any IRQs... It only happens to work for this case because >>>>>>>>>>>>> you're >>>>>>>>>>>>> still triggering interrupts. But even in that case, it's not a real >>>>>>>>>>>>> solution, but I don't think that's the argument here ;-) >>>>>>>>>>>> Sure. >>>>>>>>>>>> >>>>>>>>>>>> I'm just curious though as how it would break the poll case because >>>>>>>>>>>> io_complete_rw_iopoll() would still be called though through polling, >>>>>>>>>>>> REQ_F_IOPOLL_COMPLETED would be set, and so io_iopoll_complete() >>>>>>>>>>>> should be able to reliably check req->result. >>>>>>>>>>> It'd break the poll case because the task doing the submission is >>>>>>>>>>> generally also the one that finds and reaps completion. Hence if you >>>>>>>>>>> block that task just polling on that completion bit, you are >>>>>>>>>>> preventing >>>>>>>>>>> that very task from going and reaping completions. The condition would >>>>>>>>>>> never become true, and you are now looping forever. >>>>>>>>>>> >>>>>>>>>>>> The same poll test seemed to run ok with nvme interrupts not being >>>>>>>>>>>> triggered. Anyway, no argument that it's not needed! >>>>>>>>>>> A few reasons why it would make progress: >>>>>>>>>>> >>>>>>>>>>> - You eventually trigger a timeout on the nvme side, as blk-mq >>>>>>>>>>> finds the >>>>>>>>>>> request hasn't been completed by an IRQ. But that's a 30 >>>>>>>>>>> second ordeal >>>>>>>>>>> before that event occurs. >>>>>>>>>>> >>>>>>>>>>> - There was still interrupts enabled. >>>>>>>>>>> >>>>>>>>>>> - You have two threads, one doing submission and one doing >>>>>>>>>>> completions. >>>>>>>>>>> Maybe using SQPOLL? If that's the case, then yes, it'd still >>>>>>>>>>> work as >>>>>>>>>>> you have separate threads for submission and completion. >>>>>>>>>>> >>>>>>>>>>> For the "generic" case of just using one thread and IRQs disabled, >>>>>>>>>>> it'd >>>>>>>>>>> deadlock. >>>>>>>>>>> >>>>>>>>>>>>> I see what the race is now, it's specific to IRQ driven polling. We >>>>>>>>>>>>> really should just disallow that, to be honest, it doesn't make any >>>>>>>>>>>>> sense. But let me think about if we can do a reasonable solution >>>>>>>>>>>>> to this >>>>>>>>>>>>> that doesn't involve adding overhead for a proper setup. >>>>>>>>>>>> It's a nonsensical config in a way and so disallowing it would make >>>>>>>>>>>> the most sense. >>>>>>>>>>> Definitely. The nvme driver should not set .poll() if it doesn't have >>>>>>>>>>> non-irq poll queues. Something like this: >>>>>>>>>> Actually, we already disable polling if we don't have specific poll >>>>>>>>>> queues: >>>>>>>>>> >>>>>>>>>> if (set->nr_maps > HCTX_TYPE_POLL && >>>>>>>>>> set->map[HCTX_TYPE_POLL].nr_queues) >>>>>>>>>> blk_queue_flag_set(QUEUE_FLAG_POLL, q); >>>>>>>>>> >>>>>>>>>> Did you see any timeouts in your tests? I wonder if the use-after-free >>>>>>>>>> triggered when the timeout found the request while you had the >>>>>>>>>> busy-spin >>>>>>>>>> logic we discussed previously. >>>>>>>>> Ah, but we still have fops->iopoll() set for that case. So we just won't >>>>>>>>> poll for it, it'll get completed by IRQ. So I do think we need to handle >>>>>>>>> this case in io_uring. I'll get back to you. >>>>>>>>> >>>>>>>> I ran the same test on linux-next-20191029 in polled mode and got the >>>>>>>> same free-after-user panic: >>>>>>>> >>>>>>>> - I booted with nvme.poll_queues set and verified that all queues >>>>>>>> except default where of type poll >>>>>>>> >>>>>>>> - I added three assertions to verify the following: >>>>>>>> >>>>>>>> - nvme_timeout() is not called >>>>>>>> >>>>>>>> - io_complete_rw_iopoll() is not called from interrupt context >>>>>>>> >>>>>>>> - io_sq_offload_start() is not called with IORING_SETUP_SQPOLL set >>>>>>>> >>>>>>>> Is it possible that the race is there also in polled mode since a >>>>>>>> request submitted by one thread could conceivably be polled for and >>>>>>>> completed by a different thread, e.g. in >>>>>>>> io_uring_enter()->io_iopoll_check()? >>>>>>>> >>>>>>>> --bijan >>>>>>>> >>>>>>>> >>>>>>> I also tested my RFC again with 1 thread and with queue depths of 1 to >>>>>>> 1024 in multiples of 8 and didn't see any hangs. >>>>>>> >>>>>>> Just to be clear, the busy-spin logic discussed before was only a >>>>>>> workaround an not in the RFC. >>>>>> What is your exact test case? >>>>>> >>>>> See original cover letter. I can reproduce the failure with numjobs >>>>> between 8 and 32. >>>> And how many poll queues are you using? >>>> >>> 30 >> And how many threads/cores in the box? Trying to get a sense for how >> many CPUs share a single poll queue, if any. >> > Thread(s) per core: 2 > Core(s) per socket: 8 > Socket(s): 2 OK, so 2*8*2 == 32 threads, hence some threads will share a poll queue. -- Jens Axboe