Re: [PATCH for-5.13] io_uring: maintain drain requests' logic

Hao Xu <haoxu@xxxxxxxxxxxxxxxxx> · Tue, 6 Apr 2021 00:11:10 +0800

在 2021/4/5 上午7:07, Jens Axboe 写道:
On 4/3/21 12:58 AM, Hao Xu wrote:
在 2021/4/2 上午6:29, Pavel Begunkov 写道:
On 01/04/2021 15:55, Hao Xu wrote:
在 2021/4/1 下午6:25, Pavel Begunkov 写道:
On 01/04/2021 07:53, Hao Xu wrote:
在 2021/4/1 上午6:06, Pavel Begunkov 写道:


On 31/03/2021 10:01, Hao Xu wrote:
Now that we have multishot poll requests, one sqe can emit multiple
cqes. given below example:
        sqe0(multishot poll)-->sqe1-->sqe2(drain req)
sqe2 is designed to issue after sqe0 and sqe1 completed, but since sqe0
is a multishot poll request, sqe2 may be issued after sqe0's event
triggered twice before sqe1 completed. This isn't what users leverage
drain requests for.
Here a simple solution is to ignore all multishot poll cqes, which means
drain requests  won't wait those request to be done.

Signed-off-by: Hao Xu <haoxu@xxxxxxxxxxxxxxxxx>
---
     fs/io_uring.c | 9 +++++++--
     1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 513096759445..cd6d44cf5940 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -455,6 +455,7 @@ struct io_ring_ctx {
         struct callback_head        *exit_task_work;
           struct wait_queue_head        hash_wait;
+    unsigned                        multishot_cqes;
           /* Keep this last, we don't need it for the fast path */
         struct work_struct        exit_work;
@@ -1181,8 +1182,8 @@ static bool req_need_defer(struct io_kiocb *req, u32 seq)
         if (unlikely(req->flags & REQ_F_IO_DRAIN)) {
             struct io_ring_ctx *ctx = req->ctx;
     -        return seq != ctx->cached_cq_tail
-                + READ_ONCE(ctx->cached_cq_overflow);
+        return seq + ctx->multishot_cqes != ctx->cached_cq_tail
+            + READ_ONCE(ctx->cached_cq_overflow);
         }
           return false;
@@ -4897,6 +4898,7 @@ static bool io_poll_complete(struct io_kiocb *req, __poll_t mask, int error)
     {
         struct io_ring_ctx *ctx = req->ctx;
         unsigned flags = IORING_CQE_F_MORE;
+    bool multishot_poll = !(req->poll.events & EPOLLONESHOT);
           if (!error && req->poll.canceled) {
             error = -ECANCELED;
@@ -4911,6 +4913,9 @@ static bool io_poll_complete(struct io_kiocb *req, __poll_t mask, int error)
             req->poll.done = true;
             flags = 0;
         }
+    if (multishot_poll)
+        ctx->multishot_cqes++;
+

We need to make sure we do that only for a non-final complete, i.e.
not killing request, otherwise it'll double account the last one.
Hi Pavel, I saw a killing request like iopoll_remove or async_cancel call io_cqring_fill_event() to create an ECANCELED cqe for the original poll request. So there could be cases like(even for single poll request):
     (1). add poll --> cancel poll, an ECANCELED cqe.
                                                     1sqe:1cqe   all good
     (2). add poll --> trigger event(queued to task_work) --> cancel poll,            an ECANCELED cqe --> task_work runs, another ECANCELED cqe.
                                                     1sqe:2cqes

Those should emit a CQE on behalf of the request they're cancelling
only when it's definitely cancelled and not going to fill it
itself. E.g. if io_poll_cancel() found it and removed from
all the list and core's poll infra.

At least before multi-cqe it should have been working fine.

I haven't done a test for this, but from the code logic, there could be
case below:

io_poll_add()                         | io_poll_remove
(event happen)io_poll_wake()          | io_poll_remove_one
                                        | io_poll_remove_waitqs
                                        | io_cqring_fill_event(-ECANCELED)
                                        |
task_work run(io_poll_task_func)      |
io_poll_complete()                    |
req->poll.canceled is true, \         |
__io_cqring_fill_event(-ECANCELED)    |

two ECANCELED cqes, is there anything I missed?

Definitely may be be, but need to take a closer look

I'll do some test to test if this issue exists, and make some change if
it does.

How about something like this? Seems pointless to have an extra
variable for this, when we already track if we're going to do more
completions for this event or not. Also places the variable where
it makes the most sense, and plenty of pad space there too.

Warning: totally untested. Would be great if you could, and hoping
you're going to send out a v2.

I'm writting a test for it, will send them together soon.

diff --git a/fs/io_uring.c b/fs/io_uring.c
index f94b32b43429..1eea4998ad9b 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -423,6 +423,7 @@ struct io_ring_ctx {
  		unsigned		cq_mask;
  		atomic_t		cq_timeouts;
  		unsigned		cq_last_tm_flush;
+		unsigned		cq_extra;
  		unsigned long		cq_check_overflow;
  		struct wait_queue_head	cq_wait;
  		struct fasync_struct	*cq_fasync;
@@ -1183,8 +1184,8 @@ static bool req_need_defer(struct io_kiocb *req, u32 seq)
  	if (unlikely(req->flags & REQ_F_IO_DRAIN)) {
  		struct io_ring_ctx *ctx = req->ctx;
  
-		return seq != ctx->cached_cq_tail
-				+ READ_ONCE(ctx->cached_cq_overflow);
+		return seq + ctx->cq_extra != ctx->cached_cq_tail
+			+ READ_ONCE(ctx->cached_cq_overflow);
  	}
  
  	return false;
@@ -4894,6 +4895,9 @@ static bool io_poll_complete(struct io_kiocb *req, __poll_t mask, int error)
  		req->poll.done = true;
  		flags = 0;
  	}
+	if (flags & IORING_CQE_F_MORE)
+		ctx->cq_extra++;
+
  	io_commit_cqring(ctx);
  	return !(flags & IORING_CQE_F_MORE);
  }