Re: [RFC 0/2] io_uring: examine request result only after completion

Jens Axboe <axboe@xxxxxxxxx> · Thu, 24 Oct 2019 16:31:31 -0600

On 10/24/19 1:18 PM, Bijan Mottahedeh wrote:
> 
> On 10/24/19 10:09 AM, Jens Axboe wrote:
>> On 10/24/19 3:18 AM, Bijan Mottahedeh wrote:
>>> Running an fio test consistenly crashes the kernel with the trace included
>>> below.  The root cause seems to be the code in __io_submit_sqe() that
>>> checks the result of a request for -EAGAIN in polled mode, without
>>> ensuring first that the request has completed:
>>>
>>> 	if (ctx->flags & IORING_SETUP_IOPOLL) {
>>> 		if (req->result == -EAGAIN)
>>> 			return -EAGAIN;
>> I'm a little confused, because we should be holding the submission
>> reference to the request still at this point. So how is it going away?
>> I must be missing something...
> 
> I don't think the submission reference is going away...
> 
> I *think* the problem has to do with the fact that
> io_complete_rw_iopoll() which sets REQ_F_IOPOLL_COMPLETED is being
> called from interrupt context in my configuration and so there is a
> potential race between updating the request there and checking it in
> __io_submit_sqe().
> 
> My first workaround was to simply poll for REQ_F_IOPOLL_COMPLETED in the
> code snippet above:
> 
>       if (req->result == --EAGAIN) {
> 
>           poll for REQ_F_IOPOLL_COMPLETED
> 
>           return -EAGAIN;
> 
> }
> 
> and that got rid of the problem.

But that will not work at all for a proper poll setup, where you don't
trigger any IRQs... It only happens to work for this case because you're
still triggering interrupts. But even in that case, it's not a real
solution, but I don't think that's the argument here ;-)

I see what the race is now, it's specific to IRQ driven polling. We
really should just disallow that, to be honest, it doesn't make any
sense. But let me think about if we can do a reasonable solution to this
that doesn't involve adding overhead for a proper setup.

>>> The request will be immediately resubmitted in io_sq_wq_submit_work(),
>>> potentially before the the fisrt submission has completed.  This creates
>>> a race where the original completion may set REQ_F_IOPOLL_COMPLETED in
>>> a freed submission request, overwriting the poisoned bits, casusing the
>>> panic below.
>>>
>>> 	do {
>>> 		ret = __io_submit_sqe(ctx, req, s, false);
>>> 		/*
>>> 		 * We can get EAGAIN for polled IO even though
>>> 		 * we're forcing a sync submission from here,
>>> 		 * since we can't wait for request slots on the
>>> 		 * block side.
>>> 		 */
>>> 		if (ret != -EAGAIN)
>>> 			break;
>>> 		cond_resched();
>>> 	} while (1);
>>>
>>> The suggested fix is to move a submitted request to the poll list
>>> unconditionally in polled mode.  The request can then be retried if
>>> necessary once the original submission has indeed completed.
>>>
>>> This bug raises an issue however since REQ_F_IOPOLL_COMPLETED is set
>>> in io_complete_rw_iopoll() from interrupt context.  NVMe polled queues
>>> however are not supposed to generate interrupts so it is not clear what
>>> is the reason for this apparent inconsitency.
>> It's because you're not running with poll queues for NVMe, hence you're
>> throwing a lot of performance away. Load nvme with poll_queues=X (or boot
>> with nvme.poll_queues=X, if built in) to have a set of separate queues
>> for polling. These don't have IRQs enabled, and it'll work much faster
>> for you.
>>
> That's what I did in fact.  I booted with nvme.poll_queues=36 (I figured
> 1 per core but I'm not sure what is a reasonable number).
> 
> I also checked that /sys/block/<nvme>/queue/io_poll = 1.
> 
> What's really odd is that the irq/sec numbers from mpstat and perf show
> equivalent values with/without polling (with/without fio "hipri" option)
> even though I can see from perf top that we are in fact polling in one
> case. I don't if I missing a step or something is off in my config.

Doesn't sound right. io_poll just says if polling is enabled, not if
it's IRQ driven or not. Try this, assuming nvme0n1 is the device in
question:

# cd /sys/kernel/debug/block/nvme0n1
# grep . hctx*/type
hctx0/type:default
hctx1/type:read
hctx2/type:read
hctx3/type:read
hctx4/type:poll
hctx5/type:poll
hctx6/type:poll
hctx7/type:poll

This is just a vm I have, hence just the 8 queues. But this shows a
proper mapping setup - 1 default queue, 3 read, and 4 poll queues.

Is nvme a module on your box? Or is it built-in?

-- 
Jens Axboe