Re: [PATCH 2/3] io_uring/msg_ring: avoid double indirection task_work for data messages

Jens Axboe <axboe@xxxxxxxxx> · Tue, 28 May 2024 11:59:47 -0600

On 5/28/24 10:23 AM, Pavel Begunkov wrote:
> On 5/28/24 15:23, Jens Axboe wrote:
>> On 5/28/24 7:32 AM, Pavel Begunkov wrote:
>>> On 5/24/24 23:58, Jens Axboe wrote:
>>>> If IORING_SETUP_SINGLE_ISSUER is set, then we can't post CQEs remotely
>>>> to the target ring. Instead, task_work is queued for the target ring,
>>>> which is used to post the CQE. To make matters worse, once the target
>>>> CQE has been posted, task_work is then queued with the originator to
>>>> fill the completion.
>>>>
>>>> This obviously adds a bunch of overhead and latency. Instead of relying
>>>> on generic kernel task_work for this, fill an overflow entry on the
>>>> target ring and flag it as such that the target ring will flush it. This
>>>> avoids both the task_work for posting the CQE, and it means that the
>>>> originator CQE can be filled inline as well.
>>>>
>>>> In local testing, this reduces the latency on the sender side by 5-6x.
>>>>
>>>> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx>
>>>> ---
>>>>    io_uring/msg_ring.c | 77 +++++++++++++++++++++++++++++++++++++++++++--
>>>>    1 file changed, 74 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c
>>>> index feff2b0822cf..3f89ff3a40ad 100644
>>>> --- a/io_uring/msg_ring.c
>>>> +++ b/io_uring/msg_ring.c
>>>> @@ -123,6 +123,69 @@ static void io_msg_tw_complete(struct callback_head *head)
>>>>        io_req_queue_tw_complete(req, ret);
>>>>    }
>>>>    +static struct io_overflow_cqe *io_alloc_overflow(struct io_ring_ctx *target_ctx)
>>>> +{
>>>> +    bool is_cqe32 = target_ctx->flags & IORING_SETUP_CQE32;
>>>> +    size_t cqe_size = sizeof(struct io_overflow_cqe);
>>>> +    struct io_overflow_cqe *ocqe;
>>>> +
>>>> +    if (is_cqe32)
>>>> +        cqe_size += sizeof(struct io_uring_cqe);
>>>> +
>>>> +    ocqe = kmalloc(cqe_size, GFP_ATOMIC | __GFP_ACCOUNT);
>>>
>>> __GFP_ACCOUNT looks painful
>>
>> It always is - I did add the usual alloc cache for this after posting
>> this series, which makes it a no-op basically:
> 
> Simple ring private cache wouldn't work so well with non
> uniform transfer distributions. One way messaging, userspace
> level batching, etc., but the main question is in the other
> email, i.e. maybe it's better to go with the 2 tw hop model,
> which returns memory back where it came from.

The cache is local to the ring, so anyone that sends messages to that
ring gets to use it. So I believe it should in fact work really well. If
messaging is bidirectional, then caching on the target will apply in
both directions.

-- 
Jens Axboe