On 10/24/19 10:09 AM, Jens Axboe wrote:
On 10/24/19 3:18 AM, Bijan Mottahedeh wrote:
Running an fio test consistenly crashes the kernel with the trace included
below. The root cause seems to be the code in __io_submit_sqe() that
checks the result of a request for -EAGAIN in polled mode, without
ensuring first that the request has completed:
if (ctx->flags & IORING_SETUP_IOPOLL) {
if (req->result == -EAGAIN)
return -EAGAIN;
I'm a little confused, because we should be holding the submission
reference to the request still at this point. So how is it going away?
I must be missing something...
I don't think the submission reference is going away...
I *think* the problem has to do with the fact that
io_complete_rw_iopoll() which sets REQ_F_IOPOLL_COMPLETED is being
called from interrupt context in my configuration and so there is a
potential race between updating the request there and checking it in
__io_submit_sqe().
My first workaround was to simply poll for REQ_F_IOPOLL_COMPLETED in the
code snippet above:
if (req->result == --EAGAIN) {
poll for REQ_F_IOPOLL_COMPLETED
return -EAGAIN;
}
and that got rid of the problem.
The request will be immediately resubmitted in io_sq_wq_submit_work(),
potentially before the the fisrt submission has completed. This creates
a race where the original completion may set REQ_F_IOPOLL_COMPLETED in
a freed submission request, overwriting the poisoned bits, casusing the
panic below.
do {
ret = __io_submit_sqe(ctx, req, s, false);
/*
* We can get EAGAIN for polled IO even though
* we're forcing a sync submission from here,
* since we can't wait for request slots on the
* block side.
*/
if (ret != -EAGAIN)
break;
cond_resched();
} while (1);
The suggested fix is to move a submitted request to the poll list
unconditionally in polled mode. The request can then be retried if
necessary once the original submission has indeed completed.
This bug raises an issue however since REQ_F_IOPOLL_COMPLETED is set
in io_complete_rw_iopoll() from interrupt context. NVMe polled queues
however are not supposed to generate interrupts so it is not clear what
is the reason for this apparent inconsitency.
It's because you're not running with poll queues for NVMe, hence you're
throwing a lot of performance away. Load nvme with poll_queues=X (or boot
with nvme.poll_queues=X, if built in) to have a set of separate queues
for polling. These don't have IRQs enabled, and it'll work much faster
for you.
That's what I did in fact. I booted with nvme.poll_queues=36 (I figured
1 per core but I'm not sure what is a reasonable number).
I also checked that /sys/block/<nvme>/queue/io_poll = 1.
What's really odd is that the irq/sec numbers from mpstat and perf show
equivalent values with/without polling (with/without fio "hipri" option)
even though I can see from perf top that we are in fact polling in one
case. I don't if I missing a step or something is off in my config.
Thanks.
--bijan