Re: [PATCH 3/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF

Ming Lei <ming.lei@xxxxxxxxxx> · Mon, 13 Feb 2023 09:06:03 +0800

On Sat, Feb 11, 2023 at 08:55:23PM -0700, Jens Axboe wrote:
> On 2/11/23 8:22?PM, Ming Lei wrote:
> >>>> Also seems like this should be separately testable. We can't add new
> >>>> opcodes that don't have a feature test at least, and should also have
> >>>> various corner case tests. A bit of commenting outside of this below.
> >>>
> >>> OK, I will write/add one very simple ublk userspace to liburing for
> >>> test purpose.
> >>
> >> Thanks!
> > 
> > Thinking of further, if we use ublk for liburing test purpose, root is
> > often needed, even though we support un-privileged mode, which needs
> > administrator to grant access, so is it still good to do so?
> 
> That's fine, some tests already depend on root for certain things, like
> passthrough. When I run the tests, I do a pass as both a regular user
> and as root. The important bit is just that the tests skip when they are
> not root rather than fail.

OK, that is nice! I am going to write one minimized ublk userspace,
which can be used in both liburing & blktests for test purpose.

> 
> > It could be easier to add ->splice_read() on /dev/zero for test
> > purpose, just allocate zeroed pages in ->splice_read(), and add
> > them to pipe like ublk->splice_read(), and sink side can read
> > from or write to these pages, but zero's read_iter_zero() won't
> > be affected. And normal splice/tee won't connect to zero too
> > because we only allow it from kernel use.
> 
> Arguably /dev/zero should still support splice_read() as a regression
> fix as I argued to Linus, so I'd just add that as a prep patch.

Understood.

> 
> >>>> Seems like this should check for SPLICE_F_FD_IN_FIXED, and also use
> >>>> io_file_get_normal() for the non-fixed case in case someone passed in an
> >>>> io_uring fd.
> >>>
> >>> SPLICE_F_FD_IN_FIXED needs one extra word for holding splice flags, if
> >>> we can use sqe->addr3, I think it is doable.
> >>
> >> I haven't checked the rest, but you can't just use ->splice_flags for
> >> this?
> > 
> > ->splice_flags shares memory with rwflags, so can't be used.
> > 
> > I think it is fine to use ->addr3, given io_getxattr()/io_setxattr()/
> > io_msg_ring() has used that.
> 
> This is part of the confusion, as you treat it basically like a
> read/write internally, but the opcode names indicate differently. Why
> not just have a separate prep helper for these and then use a layout

Looks separate prep is cleaner.

> that makes more sense,surely rwflags aren't applicable for these
> anyway? I think that'd make it a lot cleaner.
> 
> Yeah, addr3 could easily be used, but it's makes for a really confusing
> command structure when the command is kinda-read but also kinda-splice.
> And it arguable makes more sense to treat it as the latter, as it takes
> the two fds like splice.

Yeah, it can be thought as one direct splice between device and generic
file, and I'd take more words to share the whole story here.

1) traditional splice is: 

file A(read file A to pipe buffer) -> pipe -> file B(write pipe buffer to file B)

which implements zero copy for 'cp file_A file_B'.

2) device splice (device -> pipe -> file)

If only for doing 'cp device file', the current splice works just fine, however
we do not have syscall for handling the following more generic scenario:

	dev(produce buffer to pipe) -> pipe -> file(consume buffer from pipe)

splice is supposed for transferring pages over pipe, so the above model
is reasonable. And we do have kernel APIs for doing the above by direct
pipe: splice_direct_to_actor & __splice_from_pipe.

That is basically what this patch does, and consumer could be READ
or WRITE. The current splice syscall only supports WRITE consumer(
write pipe buffer to file) in pipe's read end.

It can be used for fuse to support READ zero copy(fuse write zero copy
was supported in 2010, but never support read zero copy because of missing
such syscall), and for supporting ublk zero copy.

Actually it can help to implement any "virt" drivers, most of which just
talks with file or socket, or anything which can be handled in userspace
efficiently.

Follows the usage, suppose the syscall is named as dev_splice()

	1) "poll" device A for incoming request
	- "poll' just represents one kernel/user communication, once it
	returns, there is request peeked

    - device is one virt device and exposed to userspace and for
	receiving request from userspace

	2) handling driver specific logic
	   - it could be request mapping, such as logical volume manager,
		 the logic can be quite complicated to require Btree to map
		 request from A to underlying files, such as dm-thin or bcache

	   - it could be network based device, nbd, ceph RBD, drbd, ..., usb
	   over IP, .., there could be meta lookup over socket, send
	   command/recv reply, crypto enc/dec, ...

	3) dev_splice(A, A_offset, len, B, B_offset, OP)
	- A is the device
	- B is one generic file(regular file, block device, socket, ...)
	- OP is the consumer operation(could just be read/write or net
	  recv/send)
	- A(produce buffer) -> direct pipe -> B(consume the buffer from
	pipe by OP)

	4) after the above device_splice() is completed, request is
	completed, and send notification to userspace

Now we have io_uring command for handling 1) & 4) efficiently, the
test over ublk has shown this point. If 3) can be supported, the
complicated driver/device logic in 2) can be moved to userspace.
Then the whole driver can be implemented in userspace efficiently,
performance could even be better than kernel driver[1][2].

The trouble is that when dev_splice() becomes much more generic, it is
harder to define it as syscall. It could be easier with io_uring
compared with splice() syscall, but still not easy enough:

- if the sqe interface(for providing parameters) can be stable from
  beginning, or can be extended in future easily

- REQ_F_* has been used up

- may cause some trouble inside io_uring implementation, such as, how
to convert to other io_uring OP handling with the provide consumer op code.

That is why I treat it as io_uring read/write from the beginning, the other
two could be just treated as net recv/send, and only difference is that
buffer is retrieved from direct pipe via splice(device, offset, len).

So which type is better?  Treating it as dev_spice() or specific consumer
OP? If we treat it as splice, is it better to define it as one single
generic OP, or muliple OPs and each one is mapped to single consumer OP?

I am open about the interface definition, any comment/suggestion is
highly welcome!

[1] https://lore.kernel.org/linux-block/Yza1u1KfKa7ycQm0@T590/
[2] https://lore.kernel.org/lkml/20230126040822.GA2858@xxxxxx/T/

Thanks,
Ming