On Thu, Oct 31, 2024 at 07:35:35AM -0600, Jens Axboe wrote: > On 10/30/24 8:53 PM, Ming Lei wrote: > > On Wed, Oct 30, 2024 at 07:20:48AM -0600, Jens Axboe wrote: > >> On 10/29/24 10:11 PM, Ming Lei wrote: > >>> On Wed, Oct 30, 2024 at 11:08:16AM +0800, Ming Lei wrote: > >>>> On Tue, Oct 29, 2024 at 08:43:39PM -0600, Jens Axboe wrote: > >>> > >>> ... > >>> > >>>>> You could avoid the OP dependency with just a flag, if you really wanted > >>>>> to. But I'm not sure it makes a lot of sense. And it's a hell of a lot > >>>> > >>>> Yes, IO_LINK won't work for submitting multiple IOs concurrently, extra > >>>> syscall makes application too complicated, and IO latency is increased. > >>>> > >>>>> simpler than the sqe group scheme, which I'm a bit worried about as it's > >>>>> a bit complicated in how deep it needs to go in the code. This one > >>>>> stands alone, so I'd strongly encourage we pursue this a bit further and > >>>>> iron out the kinks. Maybe it won't work in the end, I don't know, but it > >>>>> seems pretty promising and it's soooo much simpler. > >>>> > >>>> If buffer register and lookup are always done in ->prep(), OP dependency > >>>> may be avoided. > >>> > >>> Even all buffer register and lookup are done in ->prep(), OP dependency > >>> still can't be avoided completely, such as: > >>> > >>> 1) two local buffers for sending to two sockets > >>> > >>> 2) group 1: IORING_OP_LOCAL_KBUF1 & [send(sock1), send(sock2)] > >>> > >>> 3) group 2: IORING_OP_LOCAL_KBUF2 & [send(sock1), send(sock2)] > >>> > >>> group 1 and group 2 needs to be linked, but inside each group, the two > >>> sends may be submitted in parallel. > >> > >> That is where groups of course work, in that you can submit 2 groups and > >> have each member inside each group run independently. But I do think we > >> need to decouple the local buffer and group concepts entirely. For the > >> first step, getting local buffers working with zero copy would be ideal, > >> and then just live with the fact that group 1 needs to be submitted > >> first and group 2 once the first ones are done. > > > > IMHO, it is one _kernel_ zero copy(_performance_) feature, which often > > imply: > > > > - better performance expectation > > - no big change on existed application for using this feature > > For #2, really depends on what it is. But ideally, yes, agree. > > > Application developer is less interested in sort of crippled or > > immature feature, especially need big change on existed code > > logic(then two code paths need to maintain), with potential > > performance regression. > > > > With sqe group and REQ_F_GROUP_KBUF, application just needs lines of > > code change for using the feature, and it is pretty easy to evaluate > > the feature since no any extra logic change & no extra syscall/wait > > introduced. The whole patchset has been mature enough, unfortunately > > blocked without obvious reasons. > > Let me tell you where I'm coming from. If you might recall, I originated > the whole grouping idea. Didn't complete it, but it's essentially the > same concept as REQ_F_GROUP_KBUF in that you have some dependents on a > leader, and the dependents can run in parallel rather than being > serialized by links. I'm obviously in favor of this concept, but I want > to see it being done in such a way that it's actually something we can > reason about and maintain. You want it for zero copy, which makes sense, > but I also want to ensure it's a CLEAN implementation that doesn't have > tangles in places it doesn't need to. > > You seem to be very hard to convince of making ANY changes at all. In > your mind the whole thing is done, and it's being "blocked without > obvious reason". It's not being blocked at all, I've been diligently > trying to work with you recently on getting this done. I'm at least as > interested as you in getting this work done. But I want you to work with > me a bit on some items so we can get it into a shape where I'm happy > with it, and I can maintain it going forward. > > So, please, rather than dig your heels in all the time, have an open > mind on how we can accomplish some of these things. > > > >> Once local buffers are done, we can look at doing the sqe grouping in a > >> nice way. I do think it's a potentially powerful concept, but we're > >> going to make a lot more progress on this issue if we carefully separate > >> dependencies and get each of them done separately. > > > > One fundamental difference between local buffer and REQ_F_GROUP_KBUF is > > > > - local buffer has to be provided and used in ->prep() > > - REQ_F_GROUP_KBUF needs to be provided in ->issue() instead of ->prep() > > It does not - the POC certainly did it in ->prep(), but all it really > cares about is having the ring locked. ->prep() always has that, > ->issue() _normally_ has that, unless it ends up in an io-wq punt. > > You can certainly do it in ->issue() and still have it be per-submit, > the latter which I care about for safety reasons. This just means it has > to be provided in the _first_ issue, and that IOSQE_ASYNC must not be > set on the request. I think that restriction is fine, nobody should > really be using IOSQE_ASYNC anyway. Yes, IOSQE_ASYNC can't work, and IO_LINK can't work too. The restriction on IO_LINK is too strong, because it needs big application change. > > I think the original POC maybe did more harm than good in that it was > too simplistic, and you seem too focused on the limits of that. So let I am actually trying to thinking how to code local buffer, that is why I puts these questions first. > me detail what it actually could look like. We have io_submit_state in > io_ring_ctx. This is per-submit private data, it's initialized and > flushed for each io_uring_enter(2) that submits requests. > > We have a registered file and buffer table, file_table and buf_table. > These have life times that are dependent on the ring and > registration/unregistration. We could have a local_table. This one > should be setup by some register command, eg reserving X slots for that. > At the end of submit, we'd flush this table, putting nodes in there. > Requests can look at the table in either prep or issue, and find buffer > nodes. If a request uses one of these, it grabs a ref and hence has it > available until it puts it at IO completion time. When a single submit > context is done, local_table is iterated (if any entries exist) and > existing nodes cleared and put. > > That provides a similar table to buf_table, but with a lifetime of a > submit. Just like local buf. Yes it would not be private to a single > group, it'd be private to a submit which has potentially bigger scope, > but that should not matter. > > That should give you exactly what you need, if you use > IORING_RSRC_KBUFFER in the local_table. But it could even be used for > IORING_RSRC_BUFFER as well, providing buffers for a single submit cycle > as well. > > Rather than do something completely on the side with > io_uring_kernel_buf, we can use io_rsrc_node and io_mapped_ubuf for > this. Which goes back to my initial rant in this email - use EXISTING > infrastructure for these things. A big part of why this isn't making > progress is that a lot of things are done on the side rather than being > integrated. Then you need extra io_kiocb members, where it really should > just be using io_rsrc_node and get everything else for free. No need to > do special checking and putting seperately, it's a resource node just > like any other resource node we already support. > > > The only common code could be one buffer abstraction for OP to use, but > > still used differently, ->prep() vs. ->issue(). > > With the prep vs issue thing not being an issue, then it sounds like we IO_LINK is another blocker for prep vs issue thing. > fully agree that a) it should be one buffer abstraction, and b) we Yes, can't agree more. > already have the infrastructure for this. We just need to add > IORING_RSRC_KBUFFER, which I already posted some POC code for. I am open to IORING_RSRC_KBUFFER if there isn't too strong limit for application. Disallowing IOSQE_ASYNC is fine, but not allowing IO_LINK does cause trouble for application, and need big change on app code. > > > So it is hard to call it decouple, especially REQ_F_GROUP_KBUF has been > > simple enough, and the main change is to import it in OP code. > > > > Local buffer is one smart idea, but I hope the following things may be > > settled first: > > > > 1) is it generic enough to just allow to provide local buffer during > > ->prep()? > > > > - this way actually becomes sync & nowait IO, instead AIO, and has been > > one strong constraint from UAPI viewpoint. > > > > - Driver may need to wait until some data comes, then return & provide > > the buffer with data, and local buffer can't cover this case > > This should be moot now with the above explanation. > > > 2) is it allowed to call ->uring_cmd() from io_uring_cmd_prep()? If not, > > any idea to call into driver for leasing the kernel buffer to io_uring? > > Ditto > > > 3) in OP code, how to differentiate normal userspace buffer select with > > local buffer? And how does OP know if normal buffer select or local > > kernel buffer should be used? Some OP may want to use normal buffer > > select instead of local buffer, others may want to use local buffer. > > Yes this is a key question we need to figure out. Right now using fixed > buffers needs to set ->buf_index, and the OP needs to know aboout it. > let's not confuse it with buffer select, IOSQE_BUFFER_SELECT, as that's > for provided buffers. Indeed, here what matters is fixed buffer. > > > 4) arbitrary numbers of local buffer needs to be supported, since IO > > often comes at batch, it shouldn't be hard to support it by adding xarray > > to submission state, what do you think of this added complexity? Without > > supporting arbitrary number of local buffers, performance can be just > > bad, it doesn't make sense as zc viewpoint. Meantime as number of local > > buffer is increased, more rsrc_node & imu allocation is introduced, this > > still may degrade perf a bit. > > That's fine, we just need to reserve space for them upfront. I don't > like the xarray idea, as: > > 1) xarray does internal locking, which we don't need here > 2) The existing io_rsrc_data table is what is being used for > io_rsrc_node management now. This would introduce another method for > that. > > I do want to ensure that io_submit_state_finish() is still low overhead, > and using an xarray would be more expensive than just doing: > > if (ctx->local_table.nr) > flush_nodes(); > > as you'd need to always setup an iterator. But this isn't really THAT > important. The benefit of using an xarray would be that we'd get > flexible storing of members without needing pre-registration, obviously. The main trouble could be buffer table pre-allocation if xarray isn't used. > > > 5) io_rsrc_node becomes part of interface between io_uring and driver > > for releasing the leased buffer, so extra data has to be > > added to `io_rsrc_node` for driver use. > > That's fine imho. The KBUFFER addition already adds the callback, we can > add a data thing too. The kernel you based your code on has an > io_rsrc_node that is 48 bytes in size, and my current tree has one where > it's 32 bytes in size after the table rework. If we have to add 2x8b to > support this, that's NOT a big deal and we just end up with a node > that's the same size as before. > > And we get rid of this odd intermediate io_uring_kernel_buf struct, > which is WAY bigger anyway, and requires TWO allocations where the > existing io_mapped_ubuf embeds the bvec. I'd argue two vs one allocs is > a much bigger deal for performance reasons. The structure is actually preallocated in ublk, so it is zero allocation in my patchset in case of non bio merge. > > As a final note, one thing I mentioned in an earlier email is that part > of the issue here is that there are several things that need ironing > out, and they are actually totally separate. One is the buffer side, > which this email mostly deals with, the other one is the grouping > concept. > > For the sqe grouping, one sticking point has been using that last > sqe->flags bit. I was thinking about this last night, and what if we got > away from using a flag entirely? At some point io_uring needs to deal > with this flag limitation, but it's arguably a large effort, and I'd > greatly prefer not having to paper over it to shove in grouped SQEs. > > So... what if we simply added a new OP, IORING_OP_GROUP_START, or > something like that. Hence instead of having a designated group leader > bit for an OP, eg: > > sqe = io_uring_get_sqe(ring); > io_uring_prep_read(sqe, ...); > sqe->flags |= IOSQE_GROUP_BIT; > > you'd do: > > sqe = io_uring_get_sqe(ring); > io_uring_prep_group_start(sqe, ...); > sqe->flags |= IOSQE_IO_LINK; One problem is how to map IORING_OP_GROUP_START to uring_cmd. IOSQE_IO_LINK isn't another trouble, because it becomes not possible to model dependency among groups. > > sqe = io_uring_get_sqe(ring); > io_uring_prep_read(sqe, ...); > > which would be the equivalent transformation - the read would be the > group leader as it's the first member of that chain. The read should set > IOSQE_IO_LINK for as long as it has members. The members in that group > would NOT be serialized. They would use IOSQE_IO_LINK purely to be part > of that group, but IOSQE_IO_LINK would not cause them to be serialized. > Hence the link just implies membership, not ordering within the group. > > This removes the flag issue, with the sligth caveat that IOSQE_IO_LINK > has a different meaning inside the group. Maybe you'd need a GROUP_END No, for group leader IOSQE_IO_LINK works same as any other normal SQE. Thanks, Ming