Re: [PATCH V8 0/8] io_uring: support sqe group and leased group kbuf

Ming Lei <ming.lei@xxxxxxxxxx> · Fri, 1 Nov 2024 09:39:27 +0800

On Thu, Oct 31, 2024 at 07:35:35AM -0600, Jens Axboe wrote:
> On 10/30/24 8:53 PM, Ming Lei wrote:
> > On Wed, Oct 30, 2024 at 07:20:48AM -0600, Jens Axboe wrote:
> >> On 10/29/24 10:11 PM, Ming Lei wrote:
> >>> On Wed, Oct 30, 2024 at 11:08:16AM +0800, Ming Lei wrote:
> >>>> On Tue, Oct 29, 2024 at 08:43:39PM -0600, Jens Axboe wrote:
> >>>
> >>> ...
> >>>
> >>>>> You could avoid the OP dependency with just a flag, if you really wanted
> >>>>> to. But I'm not sure it makes a lot of sense. And it's a hell of a lot
> >>>>
> >>>> Yes, IO_LINK won't work for submitting multiple IOs concurrently, extra
> >>>> syscall makes application too complicated, and IO latency is increased.
> >>>>
> >>>>> simpler than the sqe group scheme, which I'm a bit worried about as it's
> >>>>> a bit complicated in how deep it needs to go in the code. This one
> >>>>> stands alone, so I'd strongly encourage we pursue this a bit further and
> >>>>> iron out the kinks. Maybe it won't work in the end, I don't know, but it
> >>>>> seems pretty promising and it's soooo much simpler.
> >>>>
> >>>> If buffer register and lookup are always done in ->prep(), OP dependency
> >>>> may be avoided.
> >>>
> >>> Even all buffer register and lookup are done in ->prep(), OP dependency
> >>> still can't be avoided completely, such as:
> >>>
> >>> 1) two local buffers for sending to two sockets
> >>>
> >>> 2) group 1: IORING_OP_LOCAL_KBUF1 & [send(sock1), send(sock2)]  
> >>>
> >>> 3) group 2: IORING_OP_LOCAL_KBUF2 & [send(sock1), send(sock2)]
> >>>
> >>> group 1 and group 2 needs to be linked, but inside each group, the two
> >>> sends may be submitted in parallel.
> >>
> >> That is where groups of course work, in that you can submit 2 groups and
> >> have each member inside each group run independently. But I do think we
> >> need to decouple the local buffer and group concepts entirely. For the
> >> first step, getting local buffers working with zero copy would be ideal,
> >> and then just live with the fact that group 1 needs to be submitted
> >> first and group 2 once the first ones are done.
> > 
> > IMHO, it is one _kernel_ zero copy(_performance_) feature, which often
> > imply:
> > 
> > - better performance expectation
> > - no big change on existed application for using this feature
> 
> For #2, really depends on what it is. But ideally, yes, agree.
> 
> > Application developer is less interested in sort of crippled or
> > immature feature, especially need big change on existed code
> > logic(then two code paths need to maintain), with potential
> > performance regression.
> > 
> > With sqe group and REQ_F_GROUP_KBUF, application just needs lines of
> > code change for using the feature, and it is pretty easy to evaluate
> > the feature since no any extra logic change & no extra syscall/wait
> > introduced. The whole patchset has been mature enough, unfortunately
> > blocked without obvious reasons.
> 
> Let me tell you where I'm coming from. If you might recall, I originated
> the whole grouping idea. Didn't complete it, but it's essentially the
> same concept as REQ_F_GROUP_KBUF in that you have some dependents on a
> leader, and the dependents can run in parallel rather than being
> serialized by links. I'm obviously in favor of this concept, but I want
> to see it being done in such a way that it's actually something we can
> reason about and maintain. You want it for zero copy, which makes sense,
> but I also want to ensure it's a CLEAN implementation that doesn't have
> tangles in places it doesn't need to.
> 
> You seem to be very hard to convince of making ANY changes at all. In
> your mind the whole thing is done, and it's being "blocked without
> obvious reason". It's not being blocked at all, I've been diligently
> trying to work with you recently on getting this done. I'm at least as
> interested as you in getting this work done. But I want you to work with
> me a bit on some items so we can get it into a shape where I'm happy
> with it, and I can maintain it going forward.
> 
> So, please, rather than dig your heels in all the time, have an open
> mind on how we can accomplish some of these things.
> 
> 
> >> Once local buffers are done, we can look at doing the sqe grouping in a
> >> nice way. I do think it's a potentially powerful concept, but we're
> >> going to make a lot more progress on this issue if we carefully separate
> >> dependencies and get each of them done separately.
> > 
> > One fundamental difference between local buffer and REQ_F_GROUP_KBUF is
> > 
> > - local buffer has to be provided and used in ->prep()
> > - REQ_F_GROUP_KBUF needs to be provided in ->issue() instead of ->prep()
> 
> It does not - the POC certainly did it in ->prep(), but all it really
> cares about is having the ring locked. ->prep() always has that,
> ->issue() _normally_ has that, unless it ends up in an io-wq punt.
> 
> You can certainly do it in ->issue() and still have it be per-submit,
> the latter which I care about for safety reasons. This just means it has
> to be provided in the _first_ issue, and that IOSQE_ASYNC must not be
> set on the request. I think that restriction is fine, nobody should
> really be using IOSQE_ASYNC anyway.

Yes, IOSQE_ASYNC can't work, and IO_LINK can't work too.

The restriction on IO_LINK is too strong, because it needs big
application change.

> 
> I think the original POC maybe did more harm than good in that it was
> too simplistic, and you seem too focused on the limits of that. So let

I am actually trying to thinking how to code local buffer, that is why I
puts these questions first.

> me detail what it actually could look like. We have io_submit_state in
> io_ring_ctx. This is per-submit private data, it's initialized and
> flushed for each io_uring_enter(2) that submits requests.
> 
> We have a registered file and buffer table, file_table and buf_table.
> These have life times that are dependent on the ring and
> registration/unregistration. We could have a local_table. This one
> should be setup by some register command, eg reserving X slots for that.
> At the end of submit, we'd flush this table, putting nodes in there.
> Requests can look at the table in either prep or issue, and find buffer
> nodes. If a request uses one of these, it grabs a ref and hence has it
> available until it puts it at IO completion time. When a single submit
> context is done, local_table is iterated (if any entries exist) and
> existing nodes cleared and put.
> 
> That provides a similar table to buf_table, but with a lifetime of a
> submit. Just like local buf. Yes it would not be private to a single
> group, it'd be private to a submit which has potentially bigger scope,
> but that should not matter.
> 
> That should give you exactly what you need, if you use
> IORING_RSRC_KBUFFER in the local_table. But it could even be used for
> IORING_RSRC_BUFFER as well, providing buffers for a single submit cycle
> as well.
> 
> Rather than do something completely on the side with
> io_uring_kernel_buf, we can use io_rsrc_node and io_mapped_ubuf for
> this. Which goes back to my initial rant in this email - use EXISTING
> infrastructure for these things. A big part of why this isn't making
> progress is that a lot of things are done on the side rather than being
> integrated. Then you need extra io_kiocb members, where it really should
> just be using io_rsrc_node and get everything else for free. No need to
> do special checking and putting seperately, it's a resource node just
> like any other resource node we already support.
> 
> > The only common code could be one buffer abstraction for OP to use, but
> > still used differently, ->prep() vs. ->issue().
> 
> With the prep vs issue thing not being an issue, then it sounds like we

IO_LINK is another blocker for prep vs issue thing.

> fully agree that a) it should be one buffer abstraction, and b) we

Yes, can't agree more.

> already have the infrastructure for this. We just need to add
> IORING_RSRC_KBUFFER, which I already posted some POC code for.

I am open to IORING_RSRC_KBUFFER if there isn't too strong limit for
application. Disallowing IOSQE_ASYNC is fine, but not allowing IO_LINK
does cause trouble for application, and need big change on app code.

> 
> > So it is hard to call it decouple, especially REQ_F_GROUP_KBUF has been
> > simple enough, and the main change is to import it in OP code.
> > 
> > Local buffer is one smart idea, but I hope the following things may be
> > settled first:
> > 
> > 1) is it generic enough to just allow to provide local buffer during
> > ->prep()?
> > 
> > - this way actually becomes sync & nowait IO, instead AIO, and has been
> >   one strong constraint from UAPI viewpoint.
> > 
> > - Driver may need to wait until some data comes, then return & provide
> > the buffer with data, and local buffer can't cover this case
> 
> This should be moot now with the above explanation.
> 
> > 2) is it allowed to call ->uring_cmd() from io_uring_cmd_prep()? If not,
> > any idea to call into driver for leasing the kernel buffer to io_uring?
> 
> Ditto
> 
> > 3) in OP code, how to differentiate normal userspace buffer select with
> > local buffer? And how does OP know if normal buffer select or local
> > kernel buffer should be used? Some OP may want to use normal buffer
> > select instead of local buffer, others may want to use local buffer.
> 
> Yes this is a key question we need to figure out. Right now using fixed
> buffers needs to set ->buf_index, and the OP needs to know aboout it.
> let's not confuse it with buffer select, IOSQE_BUFFER_SELECT, as that's
> for provided buffers.

Indeed, here what matters is fixed buffer.

> 
> > 4) arbitrary numbers of local buffer needs to be supported, since IO
> > often comes at batch, it shouldn't be hard to support it by adding xarray
> > to submission state, what do you think of this added complexity? Without
> > supporting arbitrary number of local buffers, performance can be just
> > bad, it doesn't make sense as zc viewpoint. Meantime as number of local
> > buffer is increased, more rsrc_node & imu allocation is introduced, this
> > still may degrade perf a bit.
> 
> That's fine, we just need to reserve space for them upfront. I don't
> like the xarray idea, as:
> 
> 1) xarray does internal locking, which we don't need here
> 2) The existing io_rsrc_data table is what is being used for
>    io_rsrc_node management now. This would introduce another method for
>    that.
> 
> I do want to ensure that io_submit_state_finish() is still low overhead,
> and using an xarray would be more expensive than just doing:
> 
> if (ctx->local_table.nr)
> 	flush_nodes();
> 
> as you'd need to always setup an iterator. But this isn't really THAT
> important. The benefit of using an xarray would be that we'd get
> flexible storing of members without needing pre-registration, obviously.

The main trouble could be buffer table pre-allocation if xarray isn't
used.

> 
> > 5) io_rsrc_node becomes part of interface between io_uring and driver
> > for releasing the leased buffer, so extra data has to be
> > added to `io_rsrc_node` for driver use.
> 
> That's fine imho. The KBUFFER addition already adds the callback, we can
> add a data thing too. The kernel you based your code on has an
> io_rsrc_node that is 48 bytes in size, and my current tree has one where
> it's 32 bytes in size after the table rework. If we have to add 2x8b to
> support this, that's NOT a big deal and we just end up with a node
> that's the same size as before.
> 
> And we get rid of this odd intermediate io_uring_kernel_buf struct,
> which is WAY bigger anyway, and requires TWO allocations where the
> existing io_mapped_ubuf embeds the bvec. I'd argue two vs one allocs is
> a much bigger deal for performance reasons.

The structure is actually preallocated in ublk, so it is zero allocation
in my patchset in case of non bio merge.

> 
> As a final note, one thing I mentioned in an earlier email is that part
> of the issue here is that there are several things that need ironing
> out, and they are actually totally separate. One is the buffer side,
> which this email mostly deals with, the other one is the grouping
> concept.
> 
> For the sqe grouping, one sticking point has been using that last
> sqe->flags bit. I was thinking about this last night, and what if we got
> away from using a flag entirely? At some point io_uring needs to deal
> with this flag limitation, but it's arguably a large effort, and I'd
> greatly prefer not having to paper over it to shove in grouped SQEs.
> 
> So... what if we simply added a new OP, IORING_OP_GROUP_START, or
> something like that. Hence instead of having a designated group leader
> bit for an OP, eg:
> 
> sqe = io_uring_get_sqe(ring);
> io_uring_prep_read(sqe, ...);
> sqe->flags |= IOSQE_GROUP_BIT;
> 
> you'd do:
> 
> sqe = io_uring_get_sqe(ring);
> io_uring_prep_group_start(sqe, ...);
> sqe->flags |= IOSQE_IO_LINK;

One problem is how to map IORING_OP_GROUP_START to uring_cmd.

IOSQE_IO_LINK isn't another trouble, because it becomes not possible
to model dependency among groups.

> 
> sqe = io_uring_get_sqe(ring);
> io_uring_prep_read(sqe, ...);
> 
> which would be the equivalent transformation - the read would be the
> group leader as it's the first member of that chain. The read should set
> IOSQE_IO_LINK for as long as it has members. The members in that group
> would NOT be serialized. They would use IOSQE_IO_LINK purely to be part
> of that group, but IOSQE_IO_LINK would not cause them to be serialized.
> Hence the link just implies membership, not ordering within the group.
> 
> This removes the flag issue, with the sligth caveat that IOSQE_IO_LINK
> has a different meaning inside the group. Maybe you'd need a GROUP_END

No, for group leader IOSQE_IO_LINK works same as any other normal SQE.

Thanks, 
Ming