[no subject]

**Date** **Thread**

It can becomes quite complicated if we add feedback from completion.

Your catch on short read/recv is good, which may leak kernel
data, the problem exists on any other approach(provide kbuf) too, the
point is that it is kernel buffer, what do you think of the
following approach?

diff --git a/io_uring/kbuf.h b/io_uring/kbuf.h
index d72a6bbbbd12..c1bc4179b390 100644
--- a/io_uring/kbuf.h
+++ b/io_uring/kbuf.h
@@ -242,4 +242,14 @@ static inline void io_drop_leased_grp_kbuf(struct io_kiocb *req)
 	if (gbuf)
 		gbuf->grp_kbuf_ack(gbuf);
 }
+
+/* zero remained bytes of kernel buffer for avoiding to leak data */
+static inline void io_req_zero_remained(struct io_kiocb *req, struct iov_iter *iter)
+{
+	size_t left = iov_iter_count(iter);
+
+	printk("iter type %d, left %lu\n", iov_iter_rw(iter), left);
+	if (iov_iter_rw(iter) == READ && left > 0)
+		iov_iter_zero(left, iter);
+}
 #endif
diff --git a/io_uring/net.c b/io_uring/net.c
index 6c32be92646f..022d81b6fc65 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -899,6 +899,8 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret,
 		*ret = IOU_STOP_MULTISHOT;
 	else
 		*ret = IOU_OK;
+	if (io_use_leased_grp_kbuf(req))
+		io_req_zero_remained(req, &kmsg->msg.msg_iter);
 	io_req_msg_cleanup(req, issue_flags);
 	return true;
 }
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 76a443fa593c..565b0e742ee5 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -479,6 +479,11 @@ static bool __io_complete_rw_common(struct io_kiocb *req, long res)
 		}
 		req_set_fail(req);
 		req->cqe.res = res;
+		if (io_use_leased_grp_kbuf(req)) {
+			struct io_async_rw *io = req->async_data;
+
+			io_req_zero_remained(req, &io->iter);
+		}
 	}
 	return false;
 }

> 
> > > At least for me, patch 4 looks fine. The problem occurs when you start
> > > needing to support this different buffer type, which is in patch 6. I'm
> > > not saying we can necessarily solve this with OP_BUF_UPDATE, I just want
> > > to explore that path because if we can, then patch 6 turns into "oh
> > > let's just added registered/fixed buffer support to these ops that don't
> > > currently support it". And that would be much nicer indeed.
> ...
> > > > > would be totally fine in terms of performance. OP_BUF_UPDATE will
> > > > > _always_ completely immediately and inline, which means that it'll
> > > > > _always_ be immediately available post submission. The only think you'd
> > > > > ever have to worry about in terms of failure is a badly formed request,
> > > > > which is a programming issue, or running out of memory on the host.
> > > > > 
> > > > > > Also it makes error handling more complicated, io_uring has to remove
> > > > > > the kernel buffer when the current task is exit, dependency or order with
> > > > > > buffer provider is introduced.
> > > > > 
> > > > > Why would that be? They belong to the ring, so should be torn down as
> > > > > part of the ring anyway? Why would they be task-private, but not
> > > > > ring-private?
> > > > 
> > > > It is kernel buffer, which belongs to provider(such as ublk) instead
> > > > of uring, application may panic any time, then io_uring has to remove
> > > > the buffer for notifying the buffer owner.
> > > 
> > > But it could be an application buffer, no? You'd just need the
> > > application to provide it to ublk and have it mapped, rather than have
> > > ublk allocate it in-kernel and then use that.
> > 
> > The buffer is actually kernel 'request/bio' pages of /dev/ublkbN, and now we
> > forward and borrow it to io_uring OPs(fs rw, net send/recv), so it can't be
> > application buffer, not same with net rx.
> 
> I don't see any problem in dropping buffers from the table
> on exit, we have a lot of stuff a thread does for io_uring
> when it exits.

io_uring cancel handling has been complicated enough, now uring
command have two cancel code paths if provide kernel buffer is
added:

1) io_uring_try_cancel_uring_cmd()

2) the kernel buffer cancel code path

There might be dependency for the two.

> 
> 
> > > > In concept grouping is simpler because:
> > > > 
> > > > - buffer lifetime is aligned with group leader lifetime, so we needn't
> > > > worry buffer leak because of application accidental exit
> > > 
> > > But if it was an application buffer, that would not be a concern.
> > 
> > Yeah, but storage isn't same with network, here application buffer can't
> > support zc.
> 
> Maybe I missed how it came to app buffers, but the thing I
> initially mentioned is about storing the kernel buffer in
> the table, without any user pointers and user buffers.

Yeah, just some random words, please ignore it.

> 
> > > > - the buffer is borrowed to consumer OPs, and returned back after all
> > > > consumers are done, this way avoids any dependency
> > > > 
> > > > Meantime OP_BUF_UPDATE(provide buffer OP, remove buffer OP) becomes more
> > > > complicated:
> > > > 
> > > > - buffer leak because of app panic
> 
> Then io_uring dies and releases buffers. Or we can even add
> some code removing it, as mentioned, any task that has ever
> submitted a request already runs some io_uring code on exit.
> 
> > > > - buffer dependency issue: consumer OPs depend on provide buffer OP,
> > > > 	remove buffer OP depends on consumer OPs; two syscalls has to be
> > > > 	added for handling single ublk IO.
> > > 
> > > Seems like most of this is because of the kernel buffer too, no?
> > 
> > Yeah.
> > 
> > > 
> > > I do like the concept of the ephemeral buffer, the downside is that we
> > > need per-op support for it too. And while I'm not totally against doing
> > 
> > Can you explain per-op support a bit?
> > 
> > Now the buffer has been provided by one single uring command.
> > 
> > > that, it would be lovely if we could utilize and existing mechanism for
> > > that rather than add another one.
> 
> That would also be more flexible as not everything can be
> handled by linked request logic, and wouldn't require hacking
> into every each request type to support "consuming" leased
> buffers.

I guess you mean 'consuming' the code added in net.c and rw.c, which
can't be avoided, because it is kernel buffer, and we are supporting
it first time:

- there isn't userspace address, not like buffer select & fixed buffer
- the kernel buffer has to be returned to the provider
- the buffer has to be imported in ->issue(), can't be done in ->prep()
- short read/recv has to be dealt with

> 
> Overhead wise, let's say we fix buffer binding order and delay it
> as elaborated on below, then you can provide a buffer and link a
> consumer (e.g. send request or anything else) just as you do
> it now. You can also link a request returning the buffer to the
> same chain if you don't need extra flexibility.
> 
> As for groups, they're complicated because of the order inversion,

IMO, group complication only exists in the completion side, fortunately
it is well defined now.

buffer and table causes more complicated application, with bad
performance:

- two syscalls(uring_enter trips) are added for each ublk IO
- one extra request is added(group needs 2 requests, and add buffer
needs 3 requests for the simples case), then bigger SQ & CQ size
- extra cancel handling

group simplifies buffer lifetime a lot, since io_uring needn't to
care it at all.

> the notion of a leader and so. If we get rid of the need to impose
> more semantics onto it by mediating buffer transition through the
> table, I think we can do groups if needed but make it simpler.

The situation is just that driver leases the buffer to io_uring, not
have to transfer it to io_uring. Once it is added to table, it has to
be removed from table.

It is just like local variable vs global variable, the latter is more
complicated to use.

> 
> > > What's preventing it from registering it in ->prep()? It would be a bit
> > > odd, but there would be nothing preventing it codewise, outside of the
> > > oddity of ->prep() not being idempotent at that point. Don't follow why
> > > that would be necessary, though, can you expand?
> > 
> > ->prep() doesn't export to uring cmd, and we may not want to bother
> > drivers.
> > 
> > Also remove buffer still can't be done in ->prep().
> > 
> > Not dig into further, one big thing could be that dependency isn't
> > respected in ->prep().
> 
> And we can just fix that and move the choosing of a buffer
> to ->issue(), in which case a buffer provided by one request
> will be observable to its linked requests.

This patch does import buffer in ->issue(), as I explained to Jens:

- either all OPs are linked together with add_kbuf  & remove_kbuf, then
all OPs can't be issued concurrently

- or two syscalls are added for handling single ublk IO

The two are not great from performance viewpoint, but also complicates
application.

I don't think the above two can be avoided, or can you explain how to
do it?


thanks,
Ming