On 10/30/19 1:21 PM, Bijan Mottahedeh wrote: > > On 10/30/19 10:32 AM, Jens Axboe wrote: >> On 10/30/19 8:18 AM, Jens Axboe wrote: >>> On 10/30/19 8:02 AM, Bijan Mottahedeh wrote: >>>>> OK, so I still don't quite see where the issue is. Your setup has more >>>>> than one CPU per poll queue, and I can reproduce the issue quite easily >>>>> here with a similar setup. >>>> That's probably why I couldn't reproduce this in a vm. This time I set >>>> up one poll queue in a 8 cpu vm and reproduced it. >>> You definitely do need poll queue sharing to hit this. >>> >>>>> Below are some things that are given: >>>>> >>>>> 1) If we fail to submit the IO, io_complete_rw_iopoll() is ultimately >>>>> invoked _from_ the submission path. This means that the result is >>>>> readily available by the time we go and check: >>>>> >>>>> if (req->result == -EAGAIN) >>>>> >>>>> in __io_submit_sqe(). >>>> Is that always true? >>>> >>>> Let's say the operation was __io_submit_sqe()->io_read() >>>> >>>> By "failing to submit the io", do you mean that >>>> io_read()->call_read_iter() returned success or failure? Are you saying >>>> that req->result was set from kiocb_done() or later in the block layer? >>> By "failed to submit" I mean that req->result == -EAGAIN. We set that in >>> io_complete_rw_iopoll(), which is called off bio_wouldblock_error() in >>> the block layer. This is different from an ret != 0 return, which would >>> have been some sort of other failure we encountered, failing to submit >>> the request. For that error, we just end the request. For the >>> bio_wouldblock_error(), we need to retry submission. >>> >>>> Anyway I assume that io_read() would have to return success since >>>> otherwise __io_submit_sqe() would immediately return and not check >>>> req->result: >>>> >>>> if (ret) >>>> return ret; >>> Right, if ret != 0, then we have a fatal error for that request. >>> ->result will not have been set at all for that case. >>> >>>> So if io_read() did return success, are we guaranteed that setting >>>> req->result = -EAGAIN would always happen before the check? >>> Always, since it happens inline from when you attempt to issue the read. >>> >>>> Also, is it possible that we can be preempted in __io_submit_sqe() after >>>> the call to io_read() but before the -EAGAIN check? >>> Sure, we're not disabling preemption. But that shouldn't have any >>> bearing on this case. >>>>> This is a submission time failure, not >>>>> something that should be happening from a completion path after the >>>>> IO has been submitted successfully. >>>>> >>>>> 2) If the succeed in submitting the request, given that we have other >>>>> tasks polling, the request can complete any time. It can very well be >>>>> complete by the time we call io_iopoll_req_issued(), and this is >>>>> perfectly fine. We know the request isn't going away, as we're >>>>> holding a reference to it. kiocb has the same lifetime, as it's >>>>> embedded in the io_kiocb request. Note that this isn't the same >>>>> io_ring_ctx at all, some other task with its own io_ring_ctx just >>>>> happens to find our request when doing polling on the same queue >>>>> itself. >>>> Ah yes, it's a different io_ring_ctx, different poll list etc. For my >>> Exactly >>> >>>> own clarity, I assume all contexts are mapping the same actual sq/cq >>>> rings? >>> The ring/context isn't mapped to a particular ring, a specific IO is >>> mapped to a specific queue at the time it's being submitted. That >>> depends on the IO type and where that task happens to be running at the >>> time the IO is being submitted. >> This might just do it, except it looks like there's a bug in sbitmap >> where we don't always flush deferred clears. I'll look into that now. >> >> >> diff --git a/fs/io_uring.c b/fs/io_uring.c >> index af1937d66aee..f3ca2ee44dbd 100644 >> --- a/fs/io_uring.c >> +++ b/fs/io_uring.c >> @@ -1212,6 +1212,7 @@ static int io_prep_rw(struct io_kiocb *req, const struct sqe_submit *s, >> >> kiocb->ki_flags |= IOCB_HIPRI; >> kiocb->ki_complete = io_complete_rw_iopoll; >> + req->result = 0; >> } else { >> if (kiocb->ki_flags & IOCB_HIPRI) >> return -EINVAL; >> > I checked out a few configs and didn't hit the error. However, with 16 > poll queues and 32 fio jobs, I eventually started getting a bunch of > nvme timeout/abort messages and the system seemed to be stuck in a poll > loop. IKilling the test didn't work and the system was hung: > > ... > > [ 407.978329] nvme nvme0: I/O 679 QID 30 timeout, aborting > [ 408.085318] nvme nvme0: I/O 680 QID 30 timeout, aborting > [ 408.164145] nvme nvme0: I/O 682 QID 30 timeout, aborting > [ 408.238102] nvme nvme0: I/O 683 QID 30 timeout, aborting > [ 412.970712] nvme nvme0: Abort status: 0x0r=239k IOPS][eta 01m:30s] > [ 413.018696] nvme nvme0: Abort status: 0x0 > [ 413.066691] nvme nvme0: Abort status: 0x0 > [ 413.114684] nvme nvme0: Abort status: 0x0 > [ 438.035324] nvme nvme0: I/O 674 QID 30 timeout, reset controllers] > [ 444.287637] nvme nvme0: 15/0/16 default/read/poll queues > [ 637.073111] INFO: task kworker/u66:0:55 blocked for more than 122 > seconds. > > ... > > fio: terminating on signal 2 > Jobs: 32 (f=32): [r(32)][0.3%][eta 02d:01h:28m:36s] Right, there's some funkiness going on on the nvme side with shared poll queues, I'm looking into it right now. The patch I sent only takes care of the "submit after -EAGAIN was already received" case that you reported. -- Jens Axboe