Re: [PATCH V6 6/8] io_uring: support providing sqe group buffer

Ming Lei <ming.lei@xxxxxxxxxx> · Fri, 11 Oct 2024 12:06:18 +0800

On Fri, Oct 11, 2024 at 10:00:06AM +0800, Ming Lei wrote:
> On Thu, Oct 10, 2024 at 07:51:19PM +0100, Pavel Begunkov wrote:
> > On 10/10/24 04:00, Ming Lei wrote:
> > > On Wed, Oct 09, 2024 at 03:25:25PM +0100, Pavel Begunkov wrote:
> > > > On 10/6/24 09:20, Ming Lei wrote:
> > > > > On Fri, Oct 04, 2024 at 04:32:04PM +0100, Pavel Begunkov wrote:
> > > > > > On 9/12/24 11:49, Ming Lei wrote:
> > > > > > ...
> > > > > > > It can help to implement generic zero copy between device and related
> > > > > > > operations, such as ublk, fuse, vdpa,
> > > > > > > even network receive or whatever.
> > > > > > 
> > > > > > That's exactly the thing it can't sanely work with because
> > > > > > of this design.
> > > > > 
> > > > > The provide buffer design is absolutely generic, and basically
> > > > > 
> > > > > - group leader provides buffer for member OPs, and member OPs can borrow
> > > > > the buffer if leader allows by calling io_provide_group_kbuf()
> > > > > 
> > > > > - after member OPs consumes the buffer, the buffer is returned back by
> > > > > the callback implemented in group leader subsystem, so group leader can
> > > > > release related sources;
> > > > > 
> > > > > - and it is guaranteed that the buffer can be released always
> > > > > 
> > > > > The ublk implementation is pretty simple, it can be reused in device driver
> > > > > to share buffer with other kernel subsystems.
> > > > > 
> > > > > I don't see anything insane with the design.
> > > > 
> > > > There is nothing insane with the design, but the problem is cross
> > > > request error handling, same thing that makes links a pain to use.
> > > 
> > > Wrt. link, the whole group is linked in the chain, and it respects
> > > all existed link rule, care to share the trouble in link use case?
> > 
> > Error handling is a pain, it has been, even for pure link without
> > any groups. Even with a simple req1 -> req2, you need to track if
> > the first request fails you need to expect another failed CQE for
> > the second request, probably refcount (let's say non-atomically)
> > some structure and clean it up when you get both CQEs. It's not
> > prettier when the 2nd fails, especially if you consider short IO
> > and that you can't fully retry that partial IO, e.g. you consumed
> > data from the socket. And so on.
> > 
> > > The only thing I thought of is that group internal link isn't supported
> > > yet, but it may be added in future if use case is coming.
> > > 
> > > > It's good that with storage reads are reasonably idempotent and you
> > > > can be retried if needed. With sockets and streams, however, you
> > > > can't sanely borrow a buffer without consuming it, so if a member
> > > > request processing the buffer fails for any reason, the user data
> > > > will be dropped on the floor. I mentioned quite a while before,
> > > > if for example you stash the buffer somewhere you can access
> > > > across syscalls like the io_uring's registered buffer table, the
> > > > user at least would be able to find an error and then memcpy the
> > > > unprocessed data as a fallback.
> > > 
> > > I guess it is net rx case, which requires buffer to cross syscalls,
> > > then the buffer has to be owned by userspace, otherwise the buffer
> > > can be leaked easily.
> > > 
> > > That may not match with sqe group which is supposed to borrow kernel
> > > buffer consumed by users.
> > 
> > It doesn't necessarily require to keep buffers across syscalls
> > per se, it just can't drop the data it got on the floor. It's
> > just storage can read data again.
> 
> In case of short read, data is really stored(not dropped) in the provided
> buffer, and you can consume the short read data, or continue to read more to
> the same buffer.
> 
> What is the your real issue here?
> 
> > 
> > ...
> > > > > > > diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> > > > > > > index 793d5a26d9b8..445e5507565a 100644
> > > > > > > --- a/include/linux/io_uring_types.h
> > > > > > > +++ b/include/linux/io_uring_types.h
> > ...
> > > > > > FWIW, would be nice if during init figure we can verify that the leader
> > > > > > provides a buffer IFF there is someone consuming it, but I don't think
> > > > > 
> > > > > It isn't doable, same reason with IORING_OP_PROVIDE_BUFFERS, since buffer can
> > > > > only be provided in ->issue().
> > > > 
> > > > In theory we could, in practise it'd be too much of a pain, I agree.
> > > > 
> > > > IORING_OP_PROVIDE_BUFFERS is different as you just stash the buffer
> > > > in the io_uring instance, and it's used at an unspecified time later
> > > > by some request. In this sense the API is explicit, requests that don't
> > > > support it but marked with IOSQE_BUFFER_SELECT will be failed by the
> > > > kernel.
> > > 
> > > That is also one reason why I add ->accept_group_kbuf.
> > 
> > I probably missed that, but I haven't seen that
> 
> Such as, any OPs with fixed buffer can't set ->accept_group_kbuf.
> 
> > 
> > > > > > the semantics is flexible enough to do it sanely. i.e. there are many
> > > > > > members in a group, some might want to use the buffer and some might not.
> > > > > > 
> > ...
> > > > > > > +	if (!kbuf->bvec)
> > > > > > > +		return -EINVAL;
> > > > > > 
> > > > > > How can this happen?
> > > > > 
> > > > > OK, we can run the check in uring_cmd API.
> > > > 
> > > > Not sure I follow, if a request providing a buffer can't set
> > > > a bvec it should just fail, without exposing half made
> > > > io_uring_kernel_buf to other requests.
> > > > 
> > > > Is it rather a WARN_ON_ONCE check?
> > > 
> > > I meant we can check it in API of io_provide_group_kbuf() since the group
> > > buffer is filled by driver, since then the buffer is immutable, and we
> > > needn't any other check.
> > 
> > That's be a buggy provider, so sounds like WARN_ON_ONCE
> 
> Not at all.
> 
> If the driver provides bad buffer, all group leader and members OP will be
> failed, and userspace can get notified.
> 
> > 
> > ...
> > > > > > >     		if (unlikely(ret < 0))
> > > > > > > @@ -593,6 +600,15 @@ int io_send(struct io_kiocb *req, unsigned int issue_flags)
> > > > > > >     	if (issue_flags & IO_URING_F_NONBLOCK)
> > > > > > >     		flags |= MSG_DONTWAIT;
> > > > > > > +	if (req->flags & REQ_F_GROUP_KBUF) {
> > > > > > 
> > > > > > Does anything prevent the request to be marked by both
> > > > > > GROUP_KBUF and BUFFER_SELECT? In which case we'd set up
> > > > > > a group kbuf and then go to the io_do_buffer_select()
> > > > > > overriding all of that
> > > > > 
> > > > > It could be used in this way, and we can fail the member in
> > > > > io_queue_group_members().
> > > > 
> > > > That's where the opdef flag could actually be useful,
> > > > 
> > > > if (opdef[member]->accept_group_kbuf &&
> > > >      member->flags & SELECT_BUF)
> > > > 	fail;
> > > > 
> > > > 
> > > > > > > +		ret = io_import_group_kbuf(req,
> > > > > > > +					user_ptr_to_u64(sr->buf),
> > > > > > > +					sr->len, ITER_SOURCE,
> > > > > > > +					&kmsg->msg.msg_iter);
> > > > > > > +		if (unlikely(ret))
> > > > > > > +			return ret;
> > > > > > > +	}
> > > > > > > +
> > > > > > >     retry_bundle:
> > > > > > >     	if (io_do_buffer_select(req)) {
> > > > > > >     		struct buf_sel_arg arg = {
> > > > > > > @@ -1154,6 +1170,11 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
> > > > > > >     			goto out_free;
> > > > > > >     		}
> > > > > > >     		sr->buf = NULL;
> > > > > > > +	} else if (req->flags & REQ_F_GROUP_KBUF) {
> > > > > > 
> > > > > > What happens if we get a short read/recv?
> > > > > 
> > > > > For short read/recv, any progress is stored in iterator, nothing to do
> > > > > with the provide buffer, which is immutable.
> > > > > 
> > > > > One problem for read is reissue, but it can be handled by saving iter
> > > > > state after the group buffer is imported, I will fix it in next version.
> > > > > For net recv, offset/len of buffer is updated in case of short recv, so
> > > > > it works as expected.
> > > > 
> > > > That was one of my worries.
> > > > 
> > > > > Or any other issue with short read/recv? Can you explain in detail?
> > > > 
> > > > To sum up design wise, when members that are using the buffer as a
> > > > source, e.g. write/send, fail, the user is expected to usually reissue
> > > > both the write and the ublk cmd.
> > > > 
> > > > Let's say you have a ublk leader command providing a 4K buffer, and
> > > > you group it with a 4K send using the buffer. Let's assume the send
> > > > is short and does't only 2K of data. Then the user would normally
> > > > reissue:
> > > > 
> > > > ublk(4K, GROUP), send(off=2K)
> > > > 
> > > > That's fine assuming short IO is rare.
> > > > 
> > > > I worry more about the backward flow, ublk provides an "empty" buffer
> > > > to receive/read into. ublk wants to do something with the buffer in
> > > > the callback. What happens when read/recv is short (and cannot be
> > > > retried by io_uring)?
> > > > 
> > > > 1. ublk(provide empty 4K buffer)
> > > > 2. recv, ret=2K
> > > > 3. ->grp_kbuf_ack: ublk should commit back only 2K
> > > >     of data and not assume it's 4K
> > > 
> > > ->grp_kbuf_ack is supposed to only return back the buffer to the
> > > owner, and it doesn't care result of buffer consumption.
> > > 
> > > When ->grp_kbuf_ack() is done, it means this time buffer borrow is
> > > over.
> > > 
> > > When userspace figures out it is one short read, it will send one
> > > ublk uring_cmd to notify that this io command is completed with
> > > result(2k). ublk driver may decide to requeue this io command for
> > > retrying the remained bytes, when only remained part of the buffer
> > > is allowed to borrow in following provide uring command originated
> > > from userspace.
> > 
> > My apologies, I failed to notice that moment, even though should've
> > given it some thinking at the very beginning. I think that part would
> > be a terrible interface. Might be good enough for ublk, but we can't
> > be creating a ublk specific features that change the entire io_uring.
> > Without knowing how much data it actually got, in generic case you
> 
> You do know how much data actually got in the member OP, don't you?
> 
> > 1) need to require the buffer to be fully initialised / zeroed
> > before handing it.
> 
> The buffer is really initialized before being provided via
> io_provide_group_kbuf(). And it is one bvec buffer, anytime the part
> is consumed, the iterator is advanced, so always initialized buffer
> is provided to consumer OP.
> 
> > 2) Can't ever commit the data from the callback,
> 
> What do you mean `commit`?
> 
> The callback is documented clearly from beginning that it is for
> returning back the buffer to the owner.
> 
> Only member OPs consume buffer, and group leader provides valid buffer
> for member OP, and the buffer lifetime is aligned with group leader
> request.
> 
> > but it would need to wait until the userspace reacts. Yes, it
> > works in the specific context of ublk, but I don't think it works
> > as a generic interface.
> 
> It is just how ublk uses group buffer, but not necessary to be exactly
> this way.
> 
> Anytime the buffer is provided via io_provide_group_kbuf() successfully,
> the member OPs can consume it safely, and finally the buffer is returned
> back if all member OPs are completed. That is all.

Forget to mention:

The same buffer can be provided multiple times if it is valid, and one
offset can be added(not done yet in this patchset) easily on the provide
buffer uring_command, so the buffer can be advanced in case of short recv
in the provider side.

> Please explain why it isn't generic interface.

Thanks,
Ming