Re: [RFC 0/2] io_uring: examine request result only after completion

Bijan Mottahedeh <bijan.mottahedeh@xxxxxxxxxx> · Tue, 29 Oct 2019 12:23:37 -0700

On 10/29/19 12:17 PM, Bijan Mottahedeh wrote:

On 10/25/19 7:21 AM, Jens Axboe wrote:
On 10/25/19 8:18 AM, Jens Axboe wrote:
On 10/25/19 8:07 AM, Jens Axboe wrote:
On 10/25/19 7:46 AM, Bijan Mottahedeh wrote:
On 10/24/19 3:31 PM, Jens Axboe wrote:
On 10/24/19 1:18 PM, Bijan Mottahedeh wrote:
On 10/24/19 10:09 AM, Jens Axboe wrote:
On 10/24/19 3:18 AM, Bijan Mottahedeh wrote:
Running an fio test consistenly crashes the kernel with the 
trace included
below.  The root cause seems to be the code in 
__io_submit_sqe() that
checks the result of a request for -EAGAIN in polled mode, 
without
ensuring first that the request has completed:

    if (ctx->flags & IORING_SETUP_IOPOLL) {
        if (req->result == -EAGAIN)
            return -EAGAIN;
I'm a little confused, because we should be holding the submission
reference to the request still at this point. So how is it 
going away?
I must be missing something...
I don't think the submission reference is going away...

I *think* the problem has to do with the fact that
io_complete_rw_iopoll() which sets REQ_F_IOPOLL_COMPLETED is being
called from interrupt context in my configuration and so there is a
potential race between updating the request there and checking 
it in
__io_submit_sqe().

My first workaround was to simply poll for 
REQ_F_IOPOLL_COMPLETED in the
code snippet above:

          if (req->result == --EAGAIN) {

              poll for REQ_F_IOPOLL_COMPLETED

              return -EAGAIN;

}

and that got rid of the problem.
But that will not work at all for a proper poll setup, where you 
don't
trigger any IRQs... It only happens to work for this case because 
you're
still triggering interrupts. But even in that case, it's not a real
solution, but I don't think that's the argument here ;-)
Sure.

I'm just curious though as how it would break the poll case because
io_complete_rw_iopoll() would still be called though through polling,
REQ_F_IOPOLL_COMPLETED would be set, and so io_iopoll_complete()
should be able to reliably check req->result.
It'd break the poll case because the task doing the submission is
generally also the one that finds and reaps completion. Hence if you
block that task just polling on that completion bit, you are 
preventing
that very task from going and reaping completions. The condition would
never become true, and you are now looping forever.

The same poll test seemed to run ok with nvme interrupts not being
triggered. Anyway, no argument that it's not needed!
A few reasons why it would make progress:

- You eventually trigger a timeout on the nvme side, as blk-mq 
finds the
     request hasn't been completed by an IRQ. But that's a 30 
second ordeal
     before that event occurs.

- There was still interrupts enabled.

- You have two threads, one doing submission and one doing 
completions.
     Maybe using SQPOLL? If that's the case, then yes, it'd still 
work as
     you have separate threads for submission and completion.

For the "generic" case of just using one thread and IRQs disabled, 
it'd
deadlock.

I see what the race is now, it's specific to IRQ driven polling. We
really should just disallow that, to be honest, it doesn't make any
sense. But let me think about if we can do a reasonable solution 
to this
that doesn't involve adding overhead for a proper setup.
It's a nonsensical config in a way and so disallowing it would make
the most sense.
Definitely. The nvme driver should not set .poll() if it doesn't have
non-irq poll queues. Something like this:
Actually, we already disable polling if we don't have specific poll
queues:

          if (set->nr_maps > HCTX_TYPE_POLL &&
              set->map[HCTX_TYPE_POLL].nr_queues)
                  blk_queue_flag_set(QUEUE_FLAG_POLL, q);

Did you see any timeouts in your tests? I wonder if the use-after-free
triggered when the timeout found the request while you had the 
busy-spin
logic we discussed previously.
Ah, but we still have fops->iopoll() set for that case. So we just won't
poll for it, it'll get completed by IRQ. So I do think we need to handle
this case in io_uring. I'll get back to you.

I ran the same test on linux-next-20191029 in polled mode and got the 
same free-after-user panic:

- I booted with nvme.poll_queues set and verified that all queues 
except default where of type poll

- I added three assertions to verify the following:

    - nvme_timeout() is not called

    - io_complete_rw_iopoll() is not called from interrupt context

    - io_sq_offload_start() is not called with IORING_SETUP_SQPOLL set

Is it possible that the race is there also in polled mode since a 
request submitted by one thread could conceivably be polled for and 
completed by a different thread, e.g. in 
io_uring_enter()->io_iopoll_check()?

--bijan

I also tested my RFC again with 1 thread and with queue depths of 1 to 
1024 in multiples of 8 and didn't see any hangs.

Just to be clear, the busy-spin logic discussed before was only a 
workaround an not in the RFC.

--bijan