IORING_OP_SPLICE has problems, many of them are fundamental and rooted in the uapi design, see the patch 8 description. This patchset introduces a different approach, which came from discussions about splices and fused commands and absorbed ideas from both of them. We remove reliance onto pipes and registering "spliced" buffers with data as an io_uring's registered buffer. Then the user can use it as a usual registered buffer, e.g. pass it to IORING_OP_WRITE_FIXED. Once a buffer is released, it'll be returned back to the file it originated from via a callback. It's carried on on the level of the enitre buffer rather than on per-page basis as with splice, which, as noted by Ming, will allow more optimisations. The communication with the target file is done by a new fops callback, however the end mean of getting a buffer might change. It also peels layers of code compared to splice requests, which helps it to be more flexible and support more cases. For instance, Ming has a case where it's beneficial for the target file to provide a buffer to be filled with read/recv/etc. requests and then returned back to the file. Testing: I was benchmarking using liburing/examples/splice-bench.t [1], which also needs additional test kernel patches [2]. It implements get-buf for /dev/null, and the test grabs one page from it and then feeds it back without any actual IO, then repeats. fairness: IORING_OP_SPLICE performs very poorly not even reaching 450K qps, so one of the patches enables inline execution of it to make it more interesting but is only fine for testing. Buffer removal is done by OP_GET_BUF without issuing a separate op for that. "GET_BUF + nop" emulates the overhead by additional additional nop requests. Another aspect is that OP_GET_BUF issues OP_WRITE_FIXED, which, as profiles show, are quite expensive, which is not exactly a problem of GET_BUF but skews results. E.g. io_get_buf() - 10.7, io_write() - 24.3% The last bit is that the buffer removal, if done by a separate request, might and likely will be batched with other requests, so "GET_BUF + nop" is rather the worst case. The numbers below are "requests / s". QD | splice2() | OP_SPLICE | OP_GET_BUF | GET_BUF, link | GET_BUF + nop 1 | 5009035 | 3697020 | 3886356 | 4616123 | 2886171 2 | 4859523 | 5205564 | 5309510 | 5591521 | 4139125 4 | 4908353 | 6265771 | 6415036 | 6331249 | 5198505 8 | 4955003 | 7141326 | 7243434 | 6850088 | 5984588 16 | 4959496 | 7640409 | 7794564 | 7208221 | 6587212 32 | 4937463 | 7868501 | 8103406 | 7385890 | 6844390 The test is obviously not exhausting and it should further be tried and with more complicated cases. E.g. need quantify performance with sockets, where apoll feature will be involved, and it'll need to get internal partial IO retry support. [1] https://github.com/isilence/liburing.git io_uring/get-buf-op [2] https://github.com/isilence/linux.git io_uring/get-buf-op Links for convenience: https://github.com/isilence/liburing/tree/io_uring/get-buf-op https://github.com/isilence/linux/tree/io_uring/get-buf-op Pavel Begunkov (7): io_uring: add io_mapped_ubuf caches io_uring: add reg-buffer data directions io_uring: fail loop_rw_iter with pure bvec bufs io_uring/rsrc: introduce struct iou_buf_desc io_uring/rsrc: add buffer release callbacks io_uring/rsrc: introduce helper installing one buffer io_uring,fs: introduce IORING_OP_GET_BUF include/linux/fs.h | 2 + include/linux/io_uring.h | 19 +++++++ include/linux/io_uring_types.h | 2 + include/uapi/linux/io_uring.h | 1 + io_uring/io_uring.c | 9 ++++ io_uring/opdef.c | 11 +++++ io_uring/rsrc.c | 80 ++++++++++++++++++++++++++---- io_uring/rsrc.h | 24 +++++++-- io_uring/rw.c | 7 +++ io_uring/splice.c | 90 ++++++++++++++++++++++++++++++++++ io_uring/splice.h | 4 ++ 11 files changed, 235 insertions(+), 14 deletions(-) -- 2.40.0