Re: [PATCH 3/3] io_uring: add support for ring mapped supplied buffers

Jens Axboe <axboe@xxxxxxxxx> · Tue, 17 May 2022 09:46:08 -0600

On 5/17/22 8:18 AM, Hao Xu wrote:
> Hi All,
> 
> On 5/17/22 00:21, Jens Axboe wrote:
>> Provided buffers allow an application to supply io_uring with buffers
>> that can then be grabbed for a read/receive request, when the data
>> source is ready to deliver data. The existing scheme relies on using
>> IORING_OP_PROVIDE_BUFFERS to do that, but it can be difficult to use
>> in real world applications. It's pretty efficient if the application
>> is able to supply back batches of provided buffers when they have been
>> consumed and the application is ready to recycle them, but if
>> fragmentation occurs in the buffer space, it can become difficult to
>> supply enough buffers at the time. This hurts efficiency.
>>
>> Add a register op, IORING_REGISTER_PBUF_RING, which allows an application
>> to setup a shared queue for each buffer group of provided buffers. The
>> application can then supply buffers simply by adding them to this ring,
>> and the kernel can consume then just as easily. The ring shares the head
>> with the application, the tail remains private in the kernel.
>>
>> Provided buffers setup with IORING_REGISTER_PBUF_RING cannot use
>> IORING_OP_{PROVIDE,REMOVE}_BUFFERS for adding or removing entries to the
>> ring, they must use the mapped ring. Mapped provided buffer rings can
>> co-exist with normal provided buffers, just not within the same group ID.
>>
>> To gauge overhead of the existing scheme and evaluate the mapped ring
>> approach, a simple NOP benchmark was written. It uses a ring of 128
>> entries, and submits/completes 32 at the time. 'Replenish' is how
>> many buffers are provided back at the time after they have been
>> consumed:
>>
>> Test            Replenish            NOPs/sec
>> ================================================================
>> No provided buffers    NA                ~30M
>> Provided buffers    32                ~16M
>> Provided buffers     1                ~10M
>> Ring buffers        32                ~27M
>> Ring buffers         1                ~27M
>>
>> The ring mapped buffers perform almost as well as not using provided
>> buffers at all, and they don't care if you provided 1 or more back at
>> the same time. This means application can just replenish as they go,
>> rather than need to batch and compact, further reducing overhead in the
>> application. The NOP benchmark above doesn't need to do any compaction,
>> so that overhead isn't even reflected in the above test.
>>
>> Co-developed-by: Dylan Yudaken <dylany@xxxxxx>
>> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx>
>> ---
>>   fs/io_uring.c                 | 233 ++++++++++++++++++++++++++++++++--
>>   include/uapi/linux/io_uring.h |  36 ++++++
>>   2 files changed, 257 insertions(+), 12 deletions(-)
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 5867dcabc73b..776a9f5e5ec7 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -285,9 +285,26 @@ struct io_rsrc_data {
>>       bool                quiesce;
>>   };
>>   +#define IO_BUFFER_LIST_BUF_PER_PAGE (PAGE_SIZE / sizeof(struct io_uring_buf))
>>   struct io_buffer_list {
>> -    struct list_head buf_list;
>> +    /*
>> +     * If ->buf_nr_pages is set, then buf_pages/buf_ring are used. If not,
>> +     * then these are classic provided buffers and ->buf_list is used.
>> +     */
>> +    union {
>> +        struct list_head buf_list;
>> +        struct {
>> +            struct page **buf_pages;
>> +            struct io_uring_buf_ring *buf_ring;
>> +        };
>> +    };
>>       __u16 bgid;
>> +
>> +    /* below is for ring provided buffers */
>> +    __u16 buf_nr_pages;
>> +    __u16 nr_entries;
>> +    __u32 tail;
>> +    __u32 mask;
>>   };
>>     struct io_buffer {
>> @@ -804,6 +821,7 @@ enum {
>>       REQ_F_NEED_CLEANUP_BIT,
>>       REQ_F_POLLED_BIT,
>>       REQ_F_BUFFER_SELECTED_BIT,
>> +    REQ_F_BUFFER_RING_BIT,
>>       REQ_F_COMPLETE_INLINE_BIT,
>>       REQ_F_REISSUE_BIT,
>>       REQ_F_CREDS_BIT,
>> @@ -855,6 +873,8 @@ enum {
>>       REQ_F_POLLED        = BIT(REQ_F_POLLED_BIT),
>>       /* buffer already selected */
>>       REQ_F_BUFFER_SELECTED    = BIT(REQ_F_BUFFER_SELECTED_BIT),
>> +    /* buffer selected from ring, needs commit */
>> +    REQ_F_BUFFER_RING    = BIT(REQ_F_BUFFER_RING_BIT),
>>       /* completion is deferred through io_comp_state */
>>       REQ_F_COMPLETE_INLINE    = BIT(REQ_F_COMPLETE_INLINE_BIT),
>>       /* caller should reissue async */
>> @@ -979,6 +999,12 @@ struct io_kiocb {
>>             /* stores selected buf, valid IFF REQ_F_BUFFER_SELECTED is set */
>>           struct io_buffer    *kbuf;
>> +
>> +        /*
>> +         * stores buffer ID for ring provided buffers, valid IFF
>> +         * REQ_F_BUFFER_RING is set.
>> +         */
>> +        struct io_buffer_list    *buf_list;
>>       };
>>         union {
>> @@ -1470,8 +1496,14 @@ static inline void io_req_set_rsrc_node(struct io_kiocb *req,
>>     static unsigned int __io_put_kbuf(struct io_kiocb *req, struct list_head *list)
>>   {
>> -    req->flags &= ~REQ_F_BUFFER_SELECTED;
>> -    list_add(&req->kbuf->list, list);
>> +    if (req->flags & REQ_F_BUFFER_RING) {
>> +        if (req->buf_list)
>> +            req->buf_list->tail++;
> 
> This confused me for some time..seems [tail, head) is the registered
> bufs that kernel space can leverage? similar to what pipe logic does.
> how about swaping the name of head and tail, this way setting the kernel
> as a consumer. But this is just my personal  preference..

No agree, I'll make that change. That matches the sq ring as well, which
is the same user producer, kernel consumer setup.

>> +    tail &= bl->mask;
>> +    if (tail < IO_BUFFER_LIST_BUF_PER_PAGE) {
>> +        buf = &br->bufs[tail];
>> +    } else {
>> +        int off = tail & (IO_BUFFER_LIST_BUF_PER_PAGE - 1);
>> +        int index = tail / IO_BUFFER_LIST_BUF_PER_PAGE - 1;
> 
> Could we do some bitwise trick with some compiler check there since for
> now IO_BUFFER_LIST_BUF_PER_PAGE is a power of 2.

This is known at compile time, so the compiler should already be doing
that as it's a constant.

>> +        buf = page_address(bl->buf_pages[index]);
>> +        buf += off;
>> +    }
> 
> I'm not familiar with this part, allow me to ask, is this if else
> statement for efficiency? why choose one page as the dividing line

We need to index at the right page granularity.

-- 
Jens Axboe