On 5/28/24 10:23 AM, Pavel Begunkov wrote: > On 5/28/24 15:23, Jens Axboe wrote: >> On 5/28/24 7:32 AM, Pavel Begunkov wrote: >>> On 5/24/24 23:58, Jens Axboe wrote: >>>> If IORING_SETUP_SINGLE_ISSUER is set, then we can't post CQEs remotely >>>> to the target ring. Instead, task_work is queued for the target ring, >>>> which is used to post the CQE. To make matters worse, once the target >>>> CQE has been posted, task_work is then queued with the originator to >>>> fill the completion. >>>> >>>> This obviously adds a bunch of overhead and latency. Instead of relying >>>> on generic kernel task_work for this, fill an overflow entry on the >>>> target ring and flag it as such that the target ring will flush it. This >>>> avoids both the task_work for posting the CQE, and it means that the >>>> originator CQE can be filled inline as well. >>>> >>>> In local testing, this reduces the latency on the sender side by 5-6x. >>>> >>>> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx> >>>> --- >>>> io_uring/msg_ring.c | 77 +++++++++++++++++++++++++++++++++++++++++++-- >>>> 1 file changed, 74 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c >>>> index feff2b0822cf..3f89ff3a40ad 100644 >>>> --- a/io_uring/msg_ring.c >>>> +++ b/io_uring/msg_ring.c >>>> @@ -123,6 +123,69 @@ static void io_msg_tw_complete(struct callback_head *head) >>>> io_req_queue_tw_complete(req, ret); >>>> } >>>> +static struct io_overflow_cqe *io_alloc_overflow(struct io_ring_ctx *target_ctx) >>>> +{ >>>> + bool is_cqe32 = target_ctx->flags & IORING_SETUP_CQE32; >>>> + size_t cqe_size = sizeof(struct io_overflow_cqe); >>>> + struct io_overflow_cqe *ocqe; >>>> + >>>> + if (is_cqe32) >>>> + cqe_size += sizeof(struct io_uring_cqe); >>>> + >>>> + ocqe = kmalloc(cqe_size, GFP_ATOMIC | __GFP_ACCOUNT); >>> >>> __GFP_ACCOUNT looks painful >> >> It always is - I did add the usual alloc cache for this after posting >> this series, which makes it a no-op basically: > > Simple ring private cache wouldn't work so well with non > uniform transfer distributions. One way messaging, userspace > level batching, etc., but the main question is in the other > email, i.e. maybe it's better to go with the 2 tw hop model, > which returns memory back where it came from. The cache is local to the ring, so anyone that sends messages to that ring gets to use it. So I believe it should in fact work really well. If messaging is bidirectional, then caching on the target will apply in both directions. -- Jens Axboe