On 10/30/19 10:32 AM, Jens Axboe wrote:
On 10/30/19 8:18 AM, Jens Axboe wrote:
On 10/30/19 8:02 AM, Bijan Mottahedeh wrote:
OK, so I still don't quite see where the issue is. Your setup has more
than one CPU per poll queue, and I can reproduce the issue quite easily
here with a similar setup.
That's probably why I couldn't reproduce this in a vm. This time I set
up one poll queue in a 8 cpu vm and reproduced it.
You definitely do need poll queue sharing to hit this.
Below are some things that are given:
1) If we fail to submit the IO, io_complete_rw_iopoll() is ultimately
invoked _from_ the submission path. This means that the result is
readily available by the time we go and check:
if (req->result == -EAGAIN)
in __io_submit_sqe().
Is that always true?
Let's say the operation was __io_submit_sqe()->io_read()
By "failing to submit the io", do you mean that
io_read()->call_read_iter() returned success or failure? Are you saying
that req->result was set from kiocb_done() or later in the block layer?
By "failed to submit" I mean that req->result == -EAGAIN. We set that in
io_complete_rw_iopoll(), which is called off bio_wouldblock_error() in
the block layer. This is different from an ret != 0 return, which would
have been some sort of other failure we encountered, failing to submit
the request. For that error, we just end the request. For the
bio_wouldblock_error(), we need to retry submission.
Anyway I assume that io_read() would have to return success since
otherwise __io_submit_sqe() would immediately return and not check
req->result:
if (ret)
return ret;
Right, if ret != 0, then we have a fatal error for that request.
->result will not have been set at all for that case.
So if io_read() did return success, are we guaranteed that setting
req->result = -EAGAIN would always happen before the check?
Always, since it happens inline from when you attempt to issue the read.
Also, is it possible that we can be preempted in __io_submit_sqe() after
the call to io_read() but before the -EAGAIN check?
Sure, we're not disabling preemption. But that shouldn't have any
bearing on this case.
This is a submission time failure, not
something that should be happening from a completion path after the
IO has been submitted successfully.
2) If the succeed in submitting the request, given that we have other
tasks polling, the request can complete any time. It can very well be
complete by the time we call io_iopoll_req_issued(), and this is
perfectly fine. We know the request isn't going away, as we're
holding a reference to it. kiocb has the same lifetime, as it's
embedded in the io_kiocb request. Note that this isn't the same
io_ring_ctx at all, some other task with its own io_ring_ctx just
happens to find our request when doing polling on the same queue
itself.
Ah yes, it's a different io_ring_ctx, different poll list etc. For my
Exactly
own clarity, I assume all contexts are mapping the same actual sq/cq
rings?
The ring/context isn't mapped to a particular ring, a specific IO is
mapped to a specific queue at the time it's being submitted. That
depends on the IO type and where that task happens to be running at the
time the IO is being submitted.
This might just do it, except it looks like there's a bug in sbitmap
where we don't always flush deferred clears. I'll look into that now.
diff --git a/fs/io_uring.c b/fs/io_uring.c
index af1937d66aee..f3ca2ee44dbd 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1212,6 +1212,7 @@ static int io_prep_rw(struct io_kiocb *req, const struct sqe_submit *s,
kiocb->ki_flags |= IOCB_HIPRI;
kiocb->ki_complete = io_complete_rw_iopoll;
+ req->result = 0;
} else {
if (kiocb->ki_flags & IOCB_HIPRI)
return -EINVAL;
I checked out a few configs and didn't hit the error. However, with 16
poll queues and 32 fio jobs, I eventually started getting a bunch of
nvme timeout/abort messages and the system seemed to be stuck in a poll
loop. IKilling the test didn't work and the system was hung:
...
[ 407.978329] nvme nvme0: I/O 679 QID 30 timeout, aborting
[ 408.085318] nvme nvme0: I/O 680 QID 30 timeout, aborting
[ 408.164145] nvme nvme0: I/O 682 QID 30 timeout, aborting
[ 408.238102] nvme nvme0: I/O 683 QID 30 timeout, aborting
[ 412.970712] nvme nvme0: Abort status: 0x0r=239k IOPS][eta 01m:30s]
[ 413.018696] nvme nvme0: Abort status: 0x0
[ 413.066691] nvme nvme0: Abort status: 0x0
[ 413.114684] nvme nvme0: Abort status: 0x0
[ 438.035324] nvme nvme0: I/O 674 QID 30 timeout, reset controllers]
[ 444.287637] nvme nvme0: 15/0/16 default/read/poll queues
[ 637.073111] INFO: task kworker/u66:0:55 blocked for more than 122
seconds.
...
fio: terminating on signal 2
Jobs: 32 (f=32): [r(32)][0.3%][eta 02d:01h:28m:36s]
Couldn't ssh to the machine and started getting hung task timeout errors.