Re: [PATCH 2/3] io_uring/msg_ring: avoid double indirection task_work for data messages

Pavel Begunkov <asml.silence@xxxxxxxxx> · Tue, 28 May 2024 17:23:30 +0100

On 5/28/24 15:23, Jens Axboe wrote:
On 5/28/24 7:32 AM, Pavel Begunkov wrote:
On 5/24/24 23:58, Jens Axboe wrote:
If IORING_SETUP_SINGLE_ISSUER is set, then we can't post CQEs remotely
to the target ring. Instead, task_work is queued for the target ring,
which is used to post the CQE. To make matters worse, once the target
CQE has been posted, task_work is then queued with the originator to
fill the completion.

This obviously adds a bunch of overhead and latency. Instead of relying
on generic kernel task_work for this, fill an overflow entry on the
target ring and flag it as such that the target ring will flush it. This
avoids both the task_work for posting the CQE, and it means that the
originator CQE can be filled inline as well.

In local testing, this reduces the latency on the sender side by 5-6x.

Signed-off-by: Jens Axboe <axboe@xxxxxxxxx>
---
   io_uring/msg_ring.c | 77 +++++++++++++++++++++++++++++++++++++++++++--
   1 file changed, 74 insertions(+), 3 deletions(-)

diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c
index feff2b0822cf..3f89ff3a40ad 100644
--- a/io_uring/msg_ring.c
+++ b/io_uring/msg_ring.c
@@ -123,6 +123,69 @@ static void io_msg_tw_complete(struct callback_head *head)
       io_req_queue_tw_complete(req, ret);
   }
   +static struct io_overflow_cqe *io_alloc_overflow(struct io_ring_ctx *target_ctx)
+{
+    bool is_cqe32 = target_ctx->flags & IORING_SETUP_CQE32;
+    size_t cqe_size = sizeof(struct io_overflow_cqe);
+    struct io_overflow_cqe *ocqe;
+
+    if (is_cqe32)
+        cqe_size += sizeof(struct io_uring_cqe);
+
+    ocqe = kmalloc(cqe_size, GFP_ATOMIC | __GFP_ACCOUNT);

__GFP_ACCOUNT looks painful

It always is - I did add the usual alloc cache for this after posting
this series, which makes it a no-op basically:

Simple ring private cache wouldn't work so well with non
uniform transfer distributions. One way messaging, userspace
level batching, etc., but the main question is in the other
email, i.e. maybe it's better to go with the 2 tw hop model,
which returns memory back where it came from.

https://git.kernel.dk/cgit/linux/commit/?h=io_uring-msg_ring&id=c39ead262b60872d6d7daf55e9fc7d76dc09b29d

Just haven't posted a v2 yet.


--
Pavel Begunkov