On 5/17/22 8:18 AM, Hao Xu wrote: > Hi All, > > On 5/17/22 00:21, Jens Axboe wrote: >> Provided buffers allow an application to supply io_uring with buffers >> that can then be grabbed for a read/receive request, when the data >> source is ready to deliver data. The existing scheme relies on using >> IORING_OP_PROVIDE_BUFFERS to do that, but it can be difficult to use >> in real world applications. It's pretty efficient if the application >> is able to supply back batches of provided buffers when they have been >> consumed and the application is ready to recycle them, but if >> fragmentation occurs in the buffer space, it can become difficult to >> supply enough buffers at the time. This hurts efficiency. >> >> Add a register op, IORING_REGISTER_PBUF_RING, which allows an application >> to setup a shared queue for each buffer group of provided buffers. The >> application can then supply buffers simply by adding them to this ring, >> and the kernel can consume then just as easily. The ring shares the head >> with the application, the tail remains private in the kernel. >> >> Provided buffers setup with IORING_REGISTER_PBUF_RING cannot use >> IORING_OP_{PROVIDE,REMOVE}_BUFFERS for adding or removing entries to the >> ring, they must use the mapped ring. Mapped provided buffer rings can >> co-exist with normal provided buffers, just not within the same group ID. >> >> To gauge overhead of the existing scheme and evaluate the mapped ring >> approach, a simple NOP benchmark was written. It uses a ring of 128 >> entries, and submits/completes 32 at the time. 'Replenish' is how >> many buffers are provided back at the time after they have been >> consumed: >> >> Test Replenish NOPs/sec >> ================================================================ >> No provided buffers NA ~30M >> Provided buffers 32 ~16M >> Provided buffers 1 ~10M >> Ring buffers 32 ~27M >> Ring buffers 1 ~27M >> >> The ring mapped buffers perform almost as well as not using provided >> buffers at all, and they don't care if you provided 1 or more back at >> the same time. This means application can just replenish as they go, >> rather than need to batch and compact, further reducing overhead in the >> application. The NOP benchmark above doesn't need to do any compaction, >> so that overhead isn't even reflected in the above test. >> >> Co-developed-by: Dylan Yudaken <dylany@xxxxxx> >> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx> >> --- >> fs/io_uring.c | 233 ++++++++++++++++++++++++++++++++-- >> include/uapi/linux/io_uring.h | 36 ++++++ >> 2 files changed, 257 insertions(+), 12 deletions(-) >> >> diff --git a/fs/io_uring.c b/fs/io_uring.c >> index 5867dcabc73b..776a9f5e5ec7 100644 >> --- a/fs/io_uring.c >> +++ b/fs/io_uring.c >> @@ -285,9 +285,26 @@ struct io_rsrc_data { >> bool quiesce; >> }; >> +#define IO_BUFFER_LIST_BUF_PER_PAGE (PAGE_SIZE / sizeof(struct io_uring_buf)) >> struct io_buffer_list { >> - struct list_head buf_list; >> + /* >> + * If ->buf_nr_pages is set, then buf_pages/buf_ring are used. If not, >> + * then these are classic provided buffers and ->buf_list is used. >> + */ >> + union { >> + struct list_head buf_list; >> + struct { >> + struct page **buf_pages; >> + struct io_uring_buf_ring *buf_ring; >> + }; >> + }; >> __u16 bgid; >> + >> + /* below is for ring provided buffers */ >> + __u16 buf_nr_pages; >> + __u16 nr_entries; >> + __u32 tail; >> + __u32 mask; >> }; >> struct io_buffer { >> @@ -804,6 +821,7 @@ enum { >> REQ_F_NEED_CLEANUP_BIT, >> REQ_F_POLLED_BIT, >> REQ_F_BUFFER_SELECTED_BIT, >> + REQ_F_BUFFER_RING_BIT, >> REQ_F_COMPLETE_INLINE_BIT, >> REQ_F_REISSUE_BIT, >> REQ_F_CREDS_BIT, >> @@ -855,6 +873,8 @@ enum { >> REQ_F_POLLED = BIT(REQ_F_POLLED_BIT), >> /* buffer already selected */ >> REQ_F_BUFFER_SELECTED = BIT(REQ_F_BUFFER_SELECTED_BIT), >> + /* buffer selected from ring, needs commit */ >> + REQ_F_BUFFER_RING = BIT(REQ_F_BUFFER_RING_BIT), >> /* completion is deferred through io_comp_state */ >> REQ_F_COMPLETE_INLINE = BIT(REQ_F_COMPLETE_INLINE_BIT), >> /* caller should reissue async */ >> @@ -979,6 +999,12 @@ struct io_kiocb { >> /* stores selected buf, valid IFF REQ_F_BUFFER_SELECTED is set */ >> struct io_buffer *kbuf; >> + >> + /* >> + * stores buffer ID for ring provided buffers, valid IFF >> + * REQ_F_BUFFER_RING is set. >> + */ >> + struct io_buffer_list *buf_list; >> }; >> union { >> @@ -1470,8 +1496,14 @@ static inline void io_req_set_rsrc_node(struct io_kiocb *req, >> static unsigned int __io_put_kbuf(struct io_kiocb *req, struct list_head *list) >> { >> - req->flags &= ~REQ_F_BUFFER_SELECTED; >> - list_add(&req->kbuf->list, list); >> + if (req->flags & REQ_F_BUFFER_RING) { >> + if (req->buf_list) >> + req->buf_list->tail++; > > This confused me for some time..seems [tail, head) is the registered > bufs that kernel space can leverage? similar to what pipe logic does. > how about swaping the name of head and tail, this way setting the kernel > as a consumer. But this is just my personal preference.. No agree, I'll make that change. That matches the sq ring as well, which is the same user producer, kernel consumer setup. >> + tail &= bl->mask; >> + if (tail < IO_BUFFER_LIST_BUF_PER_PAGE) { >> + buf = &br->bufs[tail]; >> + } else { >> + int off = tail & (IO_BUFFER_LIST_BUF_PER_PAGE - 1); >> + int index = tail / IO_BUFFER_LIST_BUF_PER_PAGE - 1; > > Could we do some bitwise trick with some compiler check there since for > now IO_BUFFER_LIST_BUF_PER_PAGE is a power of 2. This is known at compile time, so the compiler should already be doing that as it's a constant. >> + buf = page_address(bl->buf_pages[index]); >> + buf += off; >> + } > > I'm not familiar with this part, allow me to ask, is this if else > statement for efficiency? why choose one page as the dividing line We need to index at the right page granularity. -- Jens Axboe