[PATCHSET v2 RFC 0/4] Add support for incremental buffer consumption

Jens Axboe <axboe@xxxxxxxxx> · Fri, 23 Aug 2024 08:42:33 -0600

Hi,

The recommended way to use io_uring for networking workloads is to use
ring provided buffers. The application sets up a ring (or several) for
buffers, and puts buffers for receiving data into them. When a recv
completes, the completion contains information on which buffer data was
received into. You can even use bundles with receive, and receive data
into multiple buffers at the same time.

This all works fine, but has some limitations in that a buffer is always
fully consumed. This patchset adds support for partial consumption of
a buffer. This, in turn, allows an application to supply fewer buffers
for receives, but of a much larger size. For example, rather than add
a ton of 1500b buffers for receiving data, the application can just add
one large buffer. Whenever data is received, only the current head part
of the buffer is consumed and used. This leads to less iteration of
buffers, and also eliminates any potential wasteage of memory if some
of the receives only partially fill a provided buffer.

Patchset is lightly tested, passes current tests and also the new test
cases I wrote for it. The liburing 'pbuf-ring-inc' branch has extra
tests and support for this, as well as having examples/proxy support
incrementally consumed buffers.

Using incrementally consumed buffers from an application point of view
is fairly trivial. Just pass the flag IOU_PBUF_RING_INC to
io_uring_setup_buf_ring(), and this marks this buffer group ID as being
incrementally consumed. Outside of that, the application just needs to
keep track of where the current read/recv point is at. See patch 4
for details. Non-incremental buffer completions are always final, in
that any completion will pass back a buffer to the application. For
incrementally consumed buffers, this isn't always the case, as the
kernel may generate more completions for a given buffer ID, if there's
more room left in it. There's a new CQE flag for that,
IORING_CQE_F_BUF_MORE. If set, the application should expect more
completions for this buffer ID.

Patch 1+2 are just basic prep patches, patch 3 reverts not being able to
set sqe->len for provide buffers for send. With incrementally consumed
buffers, controlling len is important as otherwise it would be very easy
to flood the outgoing socket buffer. patch 4 is the meat of it. But
still pretty darn simple. Note that this feature ONLY works with ring
provide buffers, not with legacy/classic provided buffers. Code can also
be found here:

https://git.kernel.dk/cgit/linux/log/?h=io_uring-pbuf-partial

and it's based on current -git with the pending 6.12 io_uring patches
pulled in first.

Comments/reviews welcome! I'll add support for this to examples/proxy
in the liburing repo, and can provide some performance results post
that.

 include/uapi/linux/io_uring.h | 18 +++++++++
 io_uring/io_uring.c           |  2 +-
 io_uring/kbuf.c               | 33 +++++++++--------
 io_uring/kbuf.h               | 70 +++++++++++++++++++++++++++--------
 io_uring/net.c                | 12 +++---
 io_uring/rw.c                 |  8 ++--
 6 files changed, 100 insertions(+), 43 deletions(-)

Changes since v1:
- Add IORING_CQE_F_BUF_MORE flag. I originally intended buf->len to be
  used for this purpose, with a len of 0 left obviously means that the
  buffer is done. However, this doesn't work so well. For example, if
  the incremental buffer size is 64K, and a multishot receive first gets
  16K and then 48K. For the first completion, we decrement buf->len, and
  it's now 48K. However, we immediately process another recv for this
  request, which is 48K. Now buf->len is zero. However, the application
  gets both of these completions before seeing buf, hence it will see
  buf->len == 0 for both of these completions. Adding the BUF_MORE flag
  allows the kernel to set it for the completion that actually finished
  the buffer.
- Fix issue with send side not getting REQ_F_BUFFERS_COMMIT set, hence
  always committing early. This doesn't work for IOBL_INC.
- Allow sqe->len to be set for send + provided buffers. See note above.
- Minor cleanups.
- Move to separate barnch.
- Rebase on top of current tree(s).

-- 
Jens Axboe