On Fri, Oct 11, 2024 at 08:41:03AM -0600, Jens Axboe wrote: > On 10/11/24 8:20 AM, Ming Lei wrote: > > On Fri, Oct 11, 2024 at 07:24:27AM -0600, Jens Axboe wrote: > >> On 10/10/24 9:07 PM, Ming Lei wrote: > >>> On Thu, Oct 10, 2024 at 08:39:12PM -0600, Jens Axboe wrote: > >>>> On 10/10/24 8:30 PM, Ming Lei wrote: > >>>>> Hi Jens, > >>>>> > >>>>> On Thu, Oct 10, 2024 at 01:31:21PM -0600, Jens Axboe wrote: > >>>>>> Hi, > >>>>>> > >>>>>> Discussed this with Pavel, and on his suggestion, I tried prototyping a > >>>>>> "buffer update" opcode. Basically it works like > >>>>>> IORING_REGISTER_BUFFERS_UPDATE in that it can update an existing buffer > >>>>>> registration. But it works as an sqe rather than being a sync opcode. > >>>>>> > >>>>>> The idea here is that you could do that upfront, or as part of a chain, > >>>>>> and have it be generically available, just like any other buffer that > >>>>>> was registered upfront. You do need an empty table registered first, > >>>>>> which can just be sparse. And since you can pick the slot it goes into, > >>>>>> you can rely on that slot afterwards (either as a link, or just the > >>>>>> following sqe). > >>>>>> > >>>>>> Quick'n dirty obviously, but I did write a quick test case too to verify > >>>>>> that: > >>>>>> > >>>>>> 1) It actually works (it seems to) > >>>>> > >>>>> It doesn't work for ublk zc since ublk needs to provide one kernel buffer > >>>>> for fs rw & net send/recv to consume, and the kernel buffer is invisible > >>>>> to userspace. But __io_register_rsrc_update() only can register userspace > >>>>> buffer. > >>>> > >>>> I'd be surprised if this simple one was enough! In terms of user vs > >>>> kernel buffer, you could certainly use the same mechanism, and just > >>>> ensure that buffers are tagged appropriately. I need to think about that > >>>> a little bit. > >>> > >>> It is actually same with IORING_OP_PROVIDE_BUFFERS, so the following > >>> consumer OPs have to wait until this OP_BUF_UPDATE is completed. > >> > >> See below for the registered vs provided buffer confusion that seems to > >> be a confusion issue here. > >> > >>> Suppose we have N consumers OPs which depends on OP_BUF_UPDATE. > >>> > >>> 1) all N OPs are linked with OP_BUF_UPDATE > >>> > >>> Or > >>> > >>> 2) submit OP_BUF_UPDATE first, and wait its completion, then submit N > >>> OPs concurrently. > >> > >> Correct > >> > >>> But 1) and 2) may slow the IO handing. In 1) all N OPs are serialized, > >>> and 1 extra syscall is introduced in 2). > >> > >> Yes you don't want do do #1. But the OP_BUF_UPDATE is cheap enough that > >> you can just do it upfront. It's not ideal in terms of usage, and I get > >> where the grouping comes from. But is it possible to do the grouping in > >> a less intrusive fashion with OP_BUF_UPDATE? Because it won't change any > > > > The most of 'intrusive' change is just on patch 4, and Pavel has commented > > that it is good enough: > > > > https://lore.kernel.org/linux-block/ZwZzsPcXyazyeZnu@fedora/T/#m551e94f080b80ccbd2561e01da5ea8e17f7ee15d > > At least for me, patch 4 looks fine. The problem occurs when you start > needing to support this different buffer type, which is in patch 6. I'm > not saying we can necessarily solve this with OP_BUF_UPDATE, I just want > to explore that path because if we can, then patch 6 turns into "oh > let's just added registered/fixed buffer support to these ops that don't > currently support it". And that would be much nicer indeed. OK, in my local V7, the buffer type is actually aligned with BUFFER_SELECT from both interface & use viewpoint, since member SQE have three empty flags available. I will post V7 for review. > > > >> of the other ops in terms of buffer consumption, they'd just need fixed > >> buffer support and you'd flag the buffer index in sqe->buf_index. And > >> the nice thing about that is that while fixed/registered buffers aren't > >> really used on the networking side yet (as they don't bring any benefit > >> yet), adding support for them could potentially be useful down the line > >> anyway. > > > > With 2), two extra syscalls are added for each ublk IO, one is provide > > buffer, another is remove buffer. The two syscalls have to be sync with > > consumer OPs. > > > > I can understand the concern, but if the change can't improve perf or > > even slow things done, it loses its value. > > It'd be one extra syscall, as the remove can get bundled with the next > add. But your point still stands, yes it will add extra overhead, It can't be bundled. And the kernel buffer is blk-mq's request pages, which is per tag. Such as, for ublk-target, IO comes to tag 0, after this IO(tag 0) is handled, how can we know if there is new IO comes to tag 0 immediately? :-) > although be it pretty darn minimal. I'm actually more concerned with the > complexity for handling it. While the OP_BUF_UPDATE will always > complete immediately, there's no guarantee it's the next cqe you pull > out when peeking post submission. > > >>> The same thing exists in the next OP_BUF_UPDATE which has to wait until > >>> all the previous buffer consumers are done. So the same slow thing are > >>> doubled. Not mention the application will become more complicated. > >> > >> It does not, you can do an update on a buffer that's already inflight. > > > > UPDATE may not match the case, actually two OPs are needed, one is > > provide buffer OP and the other is remove buffer OP, both have to deal > > with the other subsystem(ublk). Remove buffer needs to be done after all > > consumer OPs are done immediately. > > You don't necessarily need the remove. If you always just use the same > slot for these, then the OP_BUF_UPDATE will just update the current > location. The buffer is per tag, and can't guarantee to be reused immediately, otherwise it isn't zero copy any more. > > > I guess you mean the buffer is reference-counted, but what if the remove > > buffer OP is run before any consumer OP? The order has to be enhanced. > > > > That is why I mention two syscalls are added. > > See above, you can just update in place, and if you do want remove, it > can get bundled with the next one. But it would be pointless to remove > only then to update right after, a single update would suffice. > > >>> Here the provided buffer is only visible among the N OPs wide, and making > >>> it global isn't necessary, and slow things down. And has kbuf lifetime > >>> issue. > >> > >> I was worried about it being too slow too, but the basic testing seems > >> like it's fine. Yes with updates inflight it'll make it a tad bit > >> slower, but really should not be a concern. I'd argue that even doing > >> the very basic of things, which would be: > >> > >> 1) Submit OP_BUF_UPDATE, get completion > >> 2) Do the rest of the ops > > > > The above adds one syscall for each ublk IO, and the following Remove > > buffer adds another syscall. > > > > Not only it slows thing down, but also makes application more > > complicated, cause two wait points are added. > > I don't think the extra overhead would be noticeable though, but the > extra complication is the main issue here. Can't agree more. > > >> would be totally fine in terms of performance. OP_BUF_UPDATE will > >> _always_ completely immediately and inline, which means that it'll > >> _always_ be immediately available post submission. The only think you'd > >> ever have to worry about in terms of failure is a badly formed request, > >> which is a programming issue, or running out of memory on the host. > >> > >>> Also it makes error handling more complicated, io_uring has to remove > >>> the kernel buffer when the current task is exit, dependency or order with > >>> buffer provider is introduced. > >> > >> Why would that be? They belong to the ring, so should be torn down as > >> part of the ring anyway? Why would they be task-private, but not > >> ring-private? > > > > It is kernel buffer, which belongs to provider(such as ublk) instead > > of uring, application may panic any time, then io_uring has to remove > > the buffer for notifying the buffer owner. > > But it could be an application buffer, no? You'd just need the > application to provide it to ublk and have it mapped, rather than have > ublk allocate it in-kernel and then use that. The buffer is actually kernel 'request/bio' pages of /dev/ublkbN, and now we forward and borrow it to io_uring OPs(fs rw, net send/recv), so it can't be application buffer, not same with net rx. > > > In concept grouping is simpler because: > > > > - buffer lifetime is aligned with group leader lifetime, so we needn't > > worry buffer leak because of application accidental exit > > But if it was an application buffer, that would not be a concern. Yeah, but storage isn't same with network, here application buffer can't support zc. > > > - the buffer is borrowed to consumer OPs, and returned back after all > > consumers are done, this way avoids any dependency > > > > Meantime OP_BUF_UPDATE(provide buffer OP, remove buffer OP) becomes more > > complicated: > > > > - buffer leak because of app panic > > - buffer dependency issue: consumer OPs depend on provide buffer OP, > > remove buffer OP depends on consumer OPs; two syscalls has to be > > added for handling single ublk IO. > > Seems like most of this is because of the kernel buffer too, no? Yeah. > > I do like the concept of the ephemeral buffer, the downside is that we > need per-op support for it too. And while I'm not totally against doing Can you explain per-op support a bit? Now the buffer has been provided by one single uring command. > that, it would be lovely if we could utilize and existing mechanism for > that rather than add another one. If existing mechanism can cover everything, our linux may not progress any more. > > >>>> There are certainly many different ways that can get propagated which > >>>> would not entail a complicated mechanism. I really like the aspect of > >>>> having the identifier being the same thing that we already use, and > >>>> hence not needing to be something new on the side. > >>>> > >>>>> Also multiple OPs may consume the buffer concurrently, which can't be > >>>>> supported by buffer select. > >>>> > >>>> Why not? You can certainly have multiple ops using the same registered > >>>> buffer concurrently right now. > >>> > >>> Please see the above problem. > >>> > >>> Also I remember that the selected buffer is removed from buffer list, > >>> see io_provided_buffer_select(), but maybe I am wrong. > >> > >> You're mixing up provided and registered buffers. Provided buffers are > >> ones that the applications gives to the kernel, and the kernel grabs and > >> consumes them. Then the application replenishes, repeat. > >> > >> Registered buffers are entirely different, those are registered with the > >> kernel and we can do things like pre-gup the pages so we don't have to > >> do them for every IO. They are entirely persistent, any multiple ops can > >> keep using them, concurrently. They don't get consumed by an IO like > >> provided buffers, they remain in place until they get unregistered (or > >> updated, like my patch) at some point. > > > > I know the difference. > > But io_provided_buffer_select() has nothing to do with registered/fixed > buffers or this use case, the above "remove from buffer list" is an > entirely different buffer concept. So there's some confusion here, just > wanted to make that clear. > > > The thing is that here we can't register the kernel buffer in ->prep(), > > and it has to be provided in ->issue() of uring command. That is similar > > with provided buffer. > > What's preventing it from registering it in ->prep()? It would be a bit > odd, but there would be nothing preventing it codewise, outside of the > oddity of ->prep() not being idempotent at that point. Don't follow why > that would be necessary, though, can you expand? ->prep() doesn't export to uring cmd, and we may not want to bother drivers. Also remove buffer still can't be done in ->prep(). Not dig into further, one big thing could be that dependency isn't respected in ->prep(). thanks, Ming