On 10/30/24 8:53 PM, Ming Lei wrote: > On Wed, Oct 30, 2024 at 07:20:48AM -0600, Jens Axboe wrote: >> On 10/29/24 10:11 PM, Ming Lei wrote: >>> On Wed, Oct 30, 2024 at 11:08:16AM +0800, Ming Lei wrote: >>>> On Tue, Oct 29, 2024 at 08:43:39PM -0600, Jens Axboe wrote: >>> >>> ... >>> >>>>> You could avoid the OP dependency with just a flag, if you really wanted >>>>> to. But I'm not sure it makes a lot of sense. And it's a hell of a lot >>>> >>>> Yes, IO_LINK won't work for submitting multiple IOs concurrently, extra >>>> syscall makes application too complicated, and IO latency is increased. >>>> >>>>> simpler than the sqe group scheme, which I'm a bit worried about as it's >>>>> a bit complicated in how deep it needs to go in the code. This one >>>>> stands alone, so I'd strongly encourage we pursue this a bit further and >>>>> iron out the kinks. Maybe it won't work in the end, I don't know, but it >>>>> seems pretty promising and it's soooo much simpler. >>>> >>>> If buffer register and lookup are always done in ->prep(), OP dependency >>>> may be avoided. >>> >>> Even all buffer register and lookup are done in ->prep(), OP dependency >>> still can't be avoided completely, such as: >>> >>> 1) two local buffers for sending to two sockets >>> >>> 2) group 1: IORING_OP_LOCAL_KBUF1 & [send(sock1), send(sock2)] >>> >>> 3) group 2: IORING_OP_LOCAL_KBUF2 & [send(sock1), send(sock2)] >>> >>> group 1 and group 2 needs to be linked, but inside each group, the two >>> sends may be submitted in parallel. >> >> That is where groups of course work, in that you can submit 2 groups and >> have each member inside each group run independently. But I do think we >> need to decouple the local buffer and group concepts entirely. For the >> first step, getting local buffers working with zero copy would be ideal, >> and then just live with the fact that group 1 needs to be submitted >> first and group 2 once the first ones are done. > > IMHO, it is one _kernel_ zero copy(_performance_) feature, which often > imply: > > - better performance expectation > - no big change on existed application for using this feature For #2, really depends on what it is. But ideally, yes, agree. > Application developer is less interested in sort of crippled or > immature feature, especially need big change on existed code > logic(then two code paths need to maintain), with potential > performance regression. > > With sqe group and REQ_F_GROUP_KBUF, application just needs lines of > code change for using the feature, and it is pretty easy to evaluate > the feature since no any extra logic change & no extra syscall/wait > introduced. The whole patchset has been mature enough, unfortunately > blocked without obvious reasons. Let me tell you where I'm coming from. If you might recall, I originated the whole grouping idea. Didn't complete it, but it's essentially the same concept as REQ_F_GROUP_KBUF in that you have some dependents on a leader, and the dependents can run in parallel rather than being serialized by links. I'm obviously in favor of this concept, but I want to see it being done in such a way that it's actually something we can reason about and maintain. You want it for zero copy, which makes sense, but I also want to ensure it's a CLEAN implementation that doesn't have tangles in places it doesn't need to. You seem to be very hard to convince of making ANY changes at all. In your mind the whole thing is done, and it's being "blocked without obvious reason". It's not being blocked at all, I've been diligently trying to work with you recently on getting this done. I'm at least as interested as you in getting this work done. But I want you to work with me a bit on some items so we can get it into a shape where I'm happy with it, and I can maintain it going forward. So, please, rather than dig your heels in all the time, have an open mind on how we can accomplish some of these things. >> Once local buffers are done, we can look at doing the sqe grouping in a >> nice way. I do think it's a potentially powerful concept, but we're >> going to make a lot more progress on this issue if we carefully separate >> dependencies and get each of them done separately. > > One fundamental difference between local buffer and REQ_F_GROUP_KBUF is > > - local buffer has to be provided and used in ->prep() > - REQ_F_GROUP_KBUF needs to be provided in ->issue() instead of ->prep() It does not - the POC certainly did it in ->prep(), but all it really cares about is having the ring locked. ->prep() always has that, ->issue() _normally_ has that, unless it ends up in an io-wq punt. You can certainly do it in ->issue() and still have it be per-submit, the latter which I care about for safety reasons. This just means it has to be provided in the _first_ issue, and that IOSQE_ASYNC must not be set on the request. I think that restriction is fine, nobody should really be using IOSQE_ASYNC anyway. I think the original POC maybe did more harm than good in that it was too simplistic, and you seem too focused on the limits of that. So let me detail what it actually could look like. We have io_submit_state in io_ring_ctx. This is per-submit private data, it's initialized and flushed for each io_uring_enter(2) that submits requests. We have a registered file and buffer table, file_table and buf_table. These have life times that are dependent on the ring and registration/unregistration. We could have a local_table. This one should be setup by some register command, eg reserving X slots for that. At the end of submit, we'd flush this table, putting nodes in there. Requests can look at the table in either prep or issue, and find buffer nodes. If a request uses one of these, it grabs a ref and hence has it available until it puts it at IO completion time. When a single submit context is done, local_table is iterated (if any entries exist) and existing nodes cleared and put. That provides a similar table to buf_table, but with a lifetime of a submit. Just like local buf. Yes it would not be private to a single group, it'd be private to a submit which has potentially bigger scope, but that should not matter. That should give you exactly what you need, if you use IORING_RSRC_KBUFFER in the local_table. But it could even be used for IORING_RSRC_BUFFER as well, providing buffers for a single submit cycle as well. Rather than do something completely on the side with io_uring_kernel_buf, we can use io_rsrc_node and io_mapped_ubuf for this. Which goes back to my initial rant in this email - use EXISTING infrastructure for these things. A big part of why this isn't making progress is that a lot of things are done on the side rather than being integrated. Then you need extra io_kiocb members, where it really should just be using io_rsrc_node and get everything else for free. No need to do special checking and putting seperately, it's a resource node just like any other resource node we already support. > The only common code could be one buffer abstraction for OP to use, but > still used differently, ->prep() vs. ->issue(). With the prep vs issue thing not being an issue, then it sounds like we fully agree that a) it should be one buffer abstraction, and b) we already have the infrastructure for this. We just need to add IORING_RSRC_KBUFFER, which I already posted some POC code for. > So it is hard to call it decouple, especially REQ_F_GROUP_KBUF has been > simple enough, and the main change is to import it in OP code. > > Local buffer is one smart idea, but I hope the following things may be > settled first: > > 1) is it generic enough to just allow to provide local buffer during > ->prep()? > > - this way actually becomes sync & nowait IO, instead AIO, and has been > one strong constraint from UAPI viewpoint. > > - Driver may need to wait until some data comes, then return & provide > the buffer with data, and local buffer can't cover this case This should be moot now with the above explanation. > 2) is it allowed to call ->uring_cmd() from io_uring_cmd_prep()? If not, > any idea to call into driver for leasing the kernel buffer to io_uring? Ditto > 3) in OP code, how to differentiate normal userspace buffer select with > local buffer? And how does OP know if normal buffer select or local > kernel buffer should be used? Some OP may want to use normal buffer > select instead of local buffer, others may want to use local buffer. Yes this is a key question we need to figure out. Right now using fixed buffers needs to set ->buf_index, and the OP needs to know aboout it. let's not confuse it with buffer select, IOSQE_BUFFER_SELECT, as that's for provided buffers. > 4) arbitrary numbers of local buffer needs to be supported, since IO > often comes at batch, it shouldn't be hard to support it by adding xarray > to submission state, what do you think of this added complexity? Without > supporting arbitrary number of local buffers, performance can be just > bad, it doesn't make sense as zc viewpoint. Meantime as number of local > buffer is increased, more rsrc_node & imu allocation is introduced, this > still may degrade perf a bit. That's fine, we just need to reserve space for them upfront. I don't like the xarray idea, as: 1) xarray does internal locking, which we don't need here 2) The existing io_rsrc_data table is what is being used for io_rsrc_node management now. This would introduce another method for that. I do want to ensure that io_submit_state_finish() is still low overhead, and using an xarray would be more expensive than just doing: if (ctx->local_table.nr) flush_nodes(); as you'd need to always setup an iterator. But this isn't really THAT important. The benefit of using an xarray would be that we'd get flexible storing of members without needing pre-registration, obviously. > 5) io_rsrc_node becomes part of interface between io_uring and driver > for releasing the leased buffer, so extra data has to be > added to `io_rsrc_node` for driver use. That's fine imho. The KBUFFER addition already adds the callback, we can add a data thing too. The kernel you based your code on has an io_rsrc_node that is 48 bytes in size, and my current tree has one where it's 32 bytes in size after the table rework. If we have to add 2x8b to support this, that's NOT a big deal and we just end up with a node that's the same size as before. And we get rid of this odd intermediate io_uring_kernel_buf struct, which is WAY bigger anyway, and requires TWO allocations where the existing io_mapped_ubuf embeds the bvec. I'd argue two vs one allocs is a much bigger deal for performance reasons. As a final note, one thing I mentioned in an earlier email is that part of the issue here is that there are several things that need ironing out, and they are actually totally separate. One is the buffer side, which this email mostly deals with, the other one is the grouping concept. For the sqe grouping, one sticking point has been using that last sqe->flags bit. I was thinking about this last night, and what if we got away from using a flag entirely? At some point io_uring needs to deal with this flag limitation, but it's arguably a large effort, and I'd greatly prefer not having to paper over it to shove in grouped SQEs. So... what if we simply added a new OP, IORING_OP_GROUP_START, or something like that. Hence instead of having a designated group leader bit for an OP, eg: sqe = io_uring_get_sqe(ring); io_uring_prep_read(sqe, ...); sqe->flags |= IOSQE_GROUP_BIT; you'd do: sqe = io_uring_get_sqe(ring); io_uring_prep_group_start(sqe, ...); sqe->flags |= IOSQE_IO_LINK; sqe = io_uring_get_sqe(ring); io_uring_prep_read(sqe, ...); which would be the equivalent transformation - the read would be the group leader as it's the first member of that chain. The read should set IOSQE_IO_LINK for as long as it has members. The members in that group would NOT be serialized. They would use IOSQE_IO_LINK purely to be part of that group, but IOSQE_IO_LINK would not cause them to be serialized. Hence the link just implies membership, not ordering within the group. This removes the flag issue, with the sligth caveat that IOSQE_IO_LINK has a different meaning inside the group. Maybe you'd need a GROUP_END op as well, so you could potentially terminate the group. Or maybe you'd just rely on the usual semantics, which is "first one that doesn't have IOSQE_IO_LINK sets marks the end of the group", which is how linked chains work right now too. The whole grouping wouldn't change at all, it's just a different way of marking what constitutes a group that doesn't run afoul of the whole flag limitation thing. Just an idea! -- Jens Axboe