[RFC 0/7] Rethinking splice

Pavel Begunkov <asml.silence@xxxxxxxxx> · Sun, 30 Apr 2023 10:35:22 +0100

IORING_OP_SPLICE has problems, many of them are fundamental and rooted
in the uapi design, see the patch 8 description. This patchset introduces
a different approach, which came from discussions about splices
and fused commands and absorbed ideas from both of them. We remove
reliance onto pipes and registering "spliced" buffers with data as an
io_uring's registered buffer. Then the user can use it as a usual
registered buffer, e.g. pass it to IORING_OP_WRITE_FIXED.

Once a buffer is released, it'll be returned back to the file it
originated from via a callback. It's carried on on the level of the
enitre buffer rather than on per-page basis as with splice, which,
as noted by Ming, will allow more optimisations.

The communication with the target file is done by a new fops callback,
however the end mean of getting a buffer might change. It also peels
layers of code compared to splice requests, which helps it to be more
flexible and support more cases. For instance, Ming has a case where
it's beneficial for the target file to provide a buffer to be filled
with read/recv/etc. requests and then returned back to the file.

Testing:

I was benchmarking using liburing/examples/splice-bench.t [1], which
also needs additional test kernel patches [2]. It implements get-buf
for /dev/null, and the test grabs one page from it and then feeds it
back without any actual IO, then repeats.

fairness:
IORING_OP_SPLICE performs very poorly not even reaching 450K qps, so one
of the patches enables inline execution of it to make it more
interesting but is only fine for testing.

Buffer removal is done by OP_GET_BUF without issuing a separate op
for that. "GET_BUF + nop" emulates the overhead by additional additional
nop requests.

Another aspect is that OP_GET_BUF issues OP_WRITE_FIXED, which, as
profiles show, are quite expensive, which is not exactly a problem of
GET_BUF but skews results. E.g. io_get_buf() - 10.7, io_write() - 24.3%

The last bit is that the buffer removal, if done by a separate request,
might and likely will be batched with other requests, so "GET_BUF + nop"
is rather the worst case.

The numbers below are "requests / s".

QD  | splice2() | OP_SPLICE | OP_GET_BUF | GET_BUF, link | GET_BUF + nop
1   | 5009035   | 3697020   | 3886356    | 4616123       | 2886171
2   | 4859523   | 5205564   | 5309510    | 5591521       | 4139125
4   | 4908353   | 6265771   | 6415036    | 6331249       | 5198505
8   | 4955003   | 7141326   | 7243434    | 6850088       | 5984588
16  | 4959496   | 7640409   | 7794564    | 7208221       | 6587212
32  | 4937463   | 7868501   | 8103406    | 7385890       | 6844390

The test is obviously not exhausting and it should further be tried
and with more complicated cases. E.g. need quantify performance with
sockets, where apoll feature will be involved, and it'll need to get
internal partial IO retry support.

[1] https://github.com/isilence/liburing.git io_uring/get-buf-op
[2] https://github.com/isilence/linux.git io_uring/get-buf-op

Links for convenience:
https://github.com/isilence/liburing/tree/io_uring/get-buf-op
https://github.com/isilence/linux/tree/io_uring/get-buf-op

Pavel Begunkov (7):
  io_uring: add io_mapped_ubuf caches
  io_uring: add reg-buffer data directions
  io_uring: fail loop_rw_iter with pure bvec bufs
  io_uring/rsrc: introduce struct iou_buf_desc
  io_uring/rsrc: add buffer release callbacks
  io_uring/rsrc: introduce helper installing one buffer
  io_uring,fs: introduce IORING_OP_GET_BUF

 include/linux/fs.h             |  2 +
 include/linux/io_uring.h       | 19 +++++++
 include/linux/io_uring_types.h |  2 +
 include/uapi/linux/io_uring.h  |  1 +
 io_uring/io_uring.c            |  9 ++++
 io_uring/opdef.c               | 11 +++++
 io_uring/rsrc.c                | 80 ++++++++++++++++++++++++++----
 io_uring/rsrc.h                | 24 +++++++--
 io_uring/rw.c                  |  7 +++
 io_uring/splice.c              | 90 ++++++++++++++++++++++++++++++++++
 io_uring/splice.h              |  4 ++
 11 files changed, 235 insertions(+), 14 deletions(-)

-- 
2.40.0