Re: [PATCH V8 0/8] io_uring: support sqe group and leased group kbuf

Ming Lei <ming.lei@xxxxxxxxxx> · Thu, 31 Oct 2024 10:53:25 +0800

On Wed, Oct 30, 2024 at 07:20:48AM -0600, Jens Axboe wrote:
> On 10/29/24 10:11 PM, Ming Lei wrote:
> > On Wed, Oct 30, 2024 at 11:08:16AM +0800, Ming Lei wrote:
> >> On Tue, Oct 29, 2024 at 08:43:39PM -0600, Jens Axboe wrote:
> > 
> > ...
> > 
> >>> You could avoid the OP dependency with just a flag, if you really wanted
> >>> to. But I'm not sure it makes a lot of sense. And it's a hell of a lot
> >>
> >> Yes, IO_LINK won't work for submitting multiple IOs concurrently, extra
> >> syscall makes application too complicated, and IO latency is increased.
> >>
> >>> simpler than the sqe group scheme, which I'm a bit worried about as it's
> >>> a bit complicated in how deep it needs to go in the code. This one
> >>> stands alone, so I'd strongly encourage we pursue this a bit further and
> >>> iron out the kinks. Maybe it won't work in the end, I don't know, but it
> >>> seems pretty promising and it's soooo much simpler.
> >>
> >> If buffer register and lookup are always done in ->prep(), OP dependency
> >> may be avoided.
> > 
> > Even all buffer register and lookup are done in ->prep(), OP dependency
> > still can't be avoided completely, such as:
> > 
> > 1) two local buffers for sending to two sockets
> > 
> > 2) group 1: IORING_OP_LOCAL_KBUF1 & [send(sock1), send(sock2)]  
> > 
> > 3) group 2: IORING_OP_LOCAL_KBUF2 & [send(sock1), send(sock2)]
> > 
> > group 1 and group 2 needs to be linked, but inside each group, the two
> > sends may be submitted in parallel.
> 
> That is where groups of course work, in that you can submit 2 groups and
> have each member inside each group run independently. But I do think we
> need to decouple the local buffer and group concepts entirely. For the
> first step, getting local buffers working with zero copy would be ideal,
> and then just live with the fact that group 1 needs to be submitted
> first and group 2 once the first ones are done.

IMHO, it is one _kernel_ zero copy(_performance_) feature, which often imply:

- better performance expectation
- no big change on existed application for using this feature

Application developer is less interested in sort of crippled or immature
feature, especially need big change on existed code logic(then two code paths
need to maintain), with potential performance regression.

With sqe group and REQ_F_GROUP_KBUF, application just needs lines of
code change for using the feature, and it is pretty easy to evaluate
the feature since no any extra logic change & no extra syscall/wait
introduced. The whole patchset has been mature enough, unfortunately
blocked without obvious reasons.

> 
> Once local buffers are done, we can look at doing the sqe grouping in a
> nice way. I do think it's a potentially powerful concept, but we're
> going to make a lot more progress on this issue if we carefully separate
> dependencies and get each of them done separately.

One fundamental difference between local buffer and REQ_F_GROUP_KBUF is

- local buffer has to be provided and used in ->prep()
- REQ_F_GROUP_KBUF needs to be provided in ->issue() instead of ->prep()

The only common code could be one buffer abstraction for OP to use, but
still used differently, ->prep() vs. ->issue().

So it is hard to call it decouple, especially REQ_F_GROUP_KBUF has been
simple enough, and the main change is to import it in OP code.

Local buffer is one smart idea, but I hope the following things may be
settled first:

1) is it generic enough to just allow to provide local buffer during
->prep()?

- this way actually becomes sync & nowait IO, instead AIO, and has been
  one strong constraint from UAPI viewpoint.

- Driver may need to wait until some data comes, then return & provide
the buffer with data, and local buffer can't cover this case

2) is it allowed to call ->uring_cmd() from io_uring_cmd_prep()? If not,
any idea to call into driver for leasing the kernel buffer to io_uring?

3) in OP code, how to differentiate normal userspace buffer select with
local buffer? And how does OP know if normal buffer select or local
kernel buffer should be used? Some OP may want to use normal buffer
select instead of local buffer, others may want to use local buffer.

4) arbitrary numbers of local buffer needs to be supported, since IO
often comes at batch, it shouldn't be hard to support it by adding xarray
to submission state, what do you think of this added complexity? Without
supporting arbitrary number of local buffers, performance can be just
bad, it doesn't make sense as zc viewpoint. Meantime as number of local
buffer is increased, more rsrc_node & imu allocation is introduced, this
still may degrade perf a bit.

5) io_rsrc_node becomes part of interface between io_uring and driver
for releasing the leased buffer, so extra data has to be
added to `io_rsrc_node` for driver use.

However, from above, the following can be concluded at least:

- it isn't generic enough, #1, #3
- it still need sqe group
- it is much more complicated than REQ_F_GROUP_KBUF only
- it can't be more efficient

Thanks,
Ming