On Fri, Oct 11, 2024 at 10:00:06AM +0800, Ming Lei wrote: > On Thu, Oct 10, 2024 at 07:51:19PM +0100, Pavel Begunkov wrote: > > On 10/10/24 04:00, Ming Lei wrote: > > > On Wed, Oct 09, 2024 at 03:25:25PM +0100, Pavel Begunkov wrote: > > > > On 10/6/24 09:20, Ming Lei wrote: > > > > > On Fri, Oct 04, 2024 at 04:32:04PM +0100, Pavel Begunkov wrote: > > > > > > On 9/12/24 11:49, Ming Lei wrote: > > > > > > ... > > > > > > > It can help to implement generic zero copy between device and related > > > > > > > operations, such as ublk, fuse, vdpa, > > > > > > > even network receive or whatever. > > > > > > > > > > > > That's exactly the thing it can't sanely work with because > > > > > > of this design. > > > > > > > > > > The provide buffer design is absolutely generic, and basically > > > > > > > > > > - group leader provides buffer for member OPs, and member OPs can borrow > > > > > the buffer if leader allows by calling io_provide_group_kbuf() > > > > > > > > > > - after member OPs consumes the buffer, the buffer is returned back by > > > > > the callback implemented in group leader subsystem, so group leader can > > > > > release related sources; > > > > > > > > > > - and it is guaranteed that the buffer can be released always > > > > > > > > > > The ublk implementation is pretty simple, it can be reused in device driver > > > > > to share buffer with other kernel subsystems. > > > > > > > > > > I don't see anything insane with the design. > > > > > > > > There is nothing insane with the design, but the problem is cross > > > > request error handling, same thing that makes links a pain to use. > > > > > > Wrt. link, the whole group is linked in the chain, and it respects > > > all existed link rule, care to share the trouble in link use case? > > > > Error handling is a pain, it has been, even for pure link without > > any groups. Even with a simple req1 -> req2, you need to track if > > the first request fails you need to expect another failed CQE for > > the second request, probably refcount (let's say non-atomically) > > some structure and clean it up when you get both CQEs. It's not > > prettier when the 2nd fails, especially if you consider short IO > > and that you can't fully retry that partial IO, e.g. you consumed > > data from the socket. And so on. > > > > > The only thing I thought of is that group internal link isn't supported > > > yet, but it may be added in future if use case is coming. > > > > > > > It's good that with storage reads are reasonably idempotent and you > > > > can be retried if needed. With sockets and streams, however, you > > > > can't sanely borrow a buffer without consuming it, so if a member > > > > request processing the buffer fails for any reason, the user data > > > > will be dropped on the floor. I mentioned quite a while before, > > > > if for example you stash the buffer somewhere you can access > > > > across syscalls like the io_uring's registered buffer table, the > > > > user at least would be able to find an error and then memcpy the > > > > unprocessed data as a fallback. > > > > > > I guess it is net rx case, which requires buffer to cross syscalls, > > > then the buffer has to be owned by userspace, otherwise the buffer > > > can be leaked easily. > > > > > > That may not match with sqe group which is supposed to borrow kernel > > > buffer consumed by users. > > > > It doesn't necessarily require to keep buffers across syscalls > > per se, it just can't drop the data it got on the floor. It's > > just storage can read data again. > > In case of short read, data is really stored(not dropped) in the provided > buffer, and you can consume the short read data, or continue to read more to > the same buffer. > > What is the your real issue here? > > > > > ... > > > > > > > diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h > > > > > > > index 793d5a26d9b8..445e5507565a 100644 > > > > > > > --- a/include/linux/io_uring_types.h > > > > > > > +++ b/include/linux/io_uring_types.h > > ... > > > > > > FWIW, would be nice if during init figure we can verify that the leader > > > > > > provides a buffer IFF there is someone consuming it, but I don't think > > > > > > > > > > It isn't doable, same reason with IORING_OP_PROVIDE_BUFFERS, since buffer can > > > > > only be provided in ->issue(). > > > > > > > > In theory we could, in practise it'd be too much of a pain, I agree. > > > > > > > > IORING_OP_PROVIDE_BUFFERS is different as you just stash the buffer > > > > in the io_uring instance, and it's used at an unspecified time later > > > > by some request. In this sense the API is explicit, requests that don't > > > > support it but marked with IOSQE_BUFFER_SELECT will be failed by the > > > > kernel. > > > > > > That is also one reason why I add ->accept_group_kbuf. > > > > I probably missed that, but I haven't seen that > > Such as, any OPs with fixed buffer can't set ->accept_group_kbuf. > > > > > > > > > the semantics is flexible enough to do it sanely. i.e. there are many > > > > > > members in a group, some might want to use the buffer and some might not. > > > > > > > > ... > > > > > > > + if (!kbuf->bvec) > > > > > > > + return -EINVAL; > > > > > > > > > > > > How can this happen? > > > > > > > > > > OK, we can run the check in uring_cmd API. > > > > > > > > Not sure I follow, if a request providing a buffer can't set > > > > a bvec it should just fail, without exposing half made > > > > io_uring_kernel_buf to other requests. > > > > > > > > Is it rather a WARN_ON_ONCE check? > > > > > > I meant we can check it in API of io_provide_group_kbuf() since the group > > > buffer is filled by driver, since then the buffer is immutable, and we > > > needn't any other check. > > > > That's be a buggy provider, so sounds like WARN_ON_ONCE > > Not at all. > > If the driver provides bad buffer, all group leader and members OP will be > failed, and userspace can get notified. > > > > > ... > > > > > > > if (unlikely(ret < 0)) > > > > > > > @@ -593,6 +600,15 @@ int io_send(struct io_kiocb *req, unsigned int issue_flags) > > > > > > > if (issue_flags & IO_URING_F_NONBLOCK) > > > > > > > flags |= MSG_DONTWAIT; > > > > > > > + if (req->flags & REQ_F_GROUP_KBUF) { > > > > > > > > > > > > Does anything prevent the request to be marked by both > > > > > > GROUP_KBUF and BUFFER_SELECT? In which case we'd set up > > > > > > a group kbuf and then go to the io_do_buffer_select() > > > > > > overriding all of that > > > > > > > > > > It could be used in this way, and we can fail the member in > > > > > io_queue_group_members(). > > > > > > > > That's where the opdef flag could actually be useful, > > > > > > > > if (opdef[member]->accept_group_kbuf && > > > > member->flags & SELECT_BUF) > > > > fail; > > > > > > > > > > > > > > > + ret = io_import_group_kbuf(req, > > > > > > > + user_ptr_to_u64(sr->buf), > > > > > > > + sr->len, ITER_SOURCE, > > > > > > > + &kmsg->msg.msg_iter); > > > > > > > + if (unlikely(ret)) > > > > > > > + return ret; > > > > > > > + } > > > > > > > + > > > > > > > retry_bundle: > > > > > > > if (io_do_buffer_select(req)) { > > > > > > > struct buf_sel_arg arg = { > > > > > > > @@ -1154,6 +1170,11 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags) > > > > > > > goto out_free; > > > > > > > } > > > > > > > sr->buf = NULL; > > > > > > > + } else if (req->flags & REQ_F_GROUP_KBUF) { > > > > > > > > > > > > What happens if we get a short read/recv? > > > > > > > > > > For short read/recv, any progress is stored in iterator, nothing to do > > > > > with the provide buffer, which is immutable. > > > > > > > > > > One problem for read is reissue, but it can be handled by saving iter > > > > > state after the group buffer is imported, I will fix it in next version. > > > > > For net recv, offset/len of buffer is updated in case of short recv, so > > > > > it works as expected. > > > > > > > > That was one of my worries. > > > > > > > > > Or any other issue with short read/recv? Can you explain in detail? > > > > > > > > To sum up design wise, when members that are using the buffer as a > > > > source, e.g. write/send, fail, the user is expected to usually reissue > > > > both the write and the ublk cmd. > > > > > > > > Let's say you have a ublk leader command providing a 4K buffer, and > > > > you group it with a 4K send using the buffer. Let's assume the send > > > > is short and does't only 2K of data. Then the user would normally > > > > reissue: > > > > > > > > ublk(4K, GROUP), send(off=2K) > > > > > > > > That's fine assuming short IO is rare. > > > > > > > > I worry more about the backward flow, ublk provides an "empty" buffer > > > > to receive/read into. ublk wants to do something with the buffer in > > > > the callback. What happens when read/recv is short (and cannot be > > > > retried by io_uring)? > > > > > > > > 1. ublk(provide empty 4K buffer) > > > > 2. recv, ret=2K > > > > 3. ->grp_kbuf_ack: ublk should commit back only 2K > > > > of data and not assume it's 4K > > > > > > ->grp_kbuf_ack is supposed to only return back the buffer to the > > > owner, and it doesn't care result of buffer consumption. > > > > > > When ->grp_kbuf_ack() is done, it means this time buffer borrow is > > > over. > > > > > > When userspace figures out it is one short read, it will send one > > > ublk uring_cmd to notify that this io command is completed with > > > result(2k). ublk driver may decide to requeue this io command for > > > retrying the remained bytes, when only remained part of the buffer > > > is allowed to borrow in following provide uring command originated > > > from userspace. > > > > My apologies, I failed to notice that moment, even though should've > > given it some thinking at the very beginning. I think that part would > > be a terrible interface. Might be good enough for ublk, but we can't > > be creating a ublk specific features that change the entire io_uring. > > Without knowing how much data it actually got, in generic case you > > You do know how much data actually got in the member OP, don't you? > > > 1) need to require the buffer to be fully initialised / zeroed > > before handing it. > > The buffer is really initialized before being provided via > io_provide_group_kbuf(). And it is one bvec buffer, anytime the part > is consumed, the iterator is advanced, so always initialized buffer > is provided to consumer OP. > > > 2) Can't ever commit the data from the callback, > > What do you mean `commit`? > > The callback is documented clearly from beginning that it is for > returning back the buffer to the owner. > > Only member OPs consume buffer, and group leader provides valid buffer > for member OP, and the buffer lifetime is aligned with group leader > request. > > > but it would need to wait until the userspace reacts. Yes, it > > works in the specific context of ublk, but I don't think it works > > as a generic interface. > > It is just how ublk uses group buffer, but not necessary to be exactly > this way. > > Anytime the buffer is provided via io_provide_group_kbuf() successfully, > the member OPs can consume it safely, and finally the buffer is returned > back if all member OPs are completed. That is all. Forget to mention: The same buffer can be provided multiple times if it is valid, and one offset can be added(not done yet in this patchset) easily on the provide buffer uring_command, so the buffer can be advanced in case of short recv in the provider side. > Please explain why it isn't generic interface. Thanks, Ming