Re: [PATCH 1/3] io_uring: Add REQ_F_CQE_SKIP support for io_uring zerocopy

Pavel Begunkov <asml.silence@xxxxxxxxx> · Fri, 12 Apr 2024 14:20:34 +0100

On 4/11/24 01:52, Oliver Crumrine wrote:
Pavel Begunkov wrote:
On 4/9/24 02:33, Oliver Crumrine wrote:
Pavel Begunkov wrote:
On 4/7/24 20:14, Oliver Crumrine wrote:
Oliver Crumrine wrote:
Pavel Begunkov wrote:
On 4/5/24 21:04, Oliver Crumrine wrote:
Pavel Begunkov wrote:
On 4/4/24 23:17, Oliver Crumrine wrote:
In his patch to enable zerocopy networking for io_uring, Pavel Begunkov
specifically disabled REQ_F_CQE_SKIP, as (at least from my
understanding) the userspace program wouldn't receive the
IORING_CQE_F_MORE flag in the result value.

No. IORING_CQE_F_MORE means there will be another CQE from this
request, so a single CQE without IORING_CQE_F_MORE is trivially
fine.

The problem is the semantics, because by suppressing the first
CQE you're loosing the result value. You might rely on WAITALL
That's already happening with io_send.

Right, and it's still annoying and hard to use
Another solution might be something where there is a counter that stores
how many CQEs with REQ_F_CQE_SKIP have been processed. Before exiting,
userspace could call a function like: io_wait_completions(int completions)
which would wait until everything is done, and then userspace could peek
the completion ring.

as other sends and "fail" (in terms of io_uring) the request
in case of a partial send posting 2 CQEs, but that's not a great
way and it's getting userspace complicated pretty easily.

In short, it was left out for later because there is a
better way to implement it, but it should be done carefully
Maybe we could put the return values in the notifs? That would be a
discrepancy between io_send and io_send_zc, though.

Yes. And yes, having a custom flavour is not good. It'd only
be well usable if apart from returning the actual result
it also guarantees there will be one and only one CQE, then
the userspace doesn't have to do the dancing with counting
and checking F_MORE. In fact, I outlined before how a generic
solution may looks like:

https://github.com/axboe/liburing/issues/824

The only interesting part, IMHO, is to be able to merge the
main completion with its notification. Below is an old stash
rebased onto for-6.10. The only thing missing is relinking,
but maybe we don't even care about it. I need to cover it
well with tests.
The patch looks pretty good. The only potential issue is that you store
the res of the normal CQE into the notif CQE. This overwrites the
IORING_CQE_F_NOTIF with IORING_CQE_F_MORE. This means that the notif would
indicate to userspace that there will be another CQE, of which there
won't.
I was wrong here; Mixed up flags and result value.

Right, it's fine. And it's synchronised by the ubuf refcounting,
though it might get more complicated if I'd try out some counting
optimisations.

FWIW, it shouldn't give any performance wins. The heavy stuff is
notifications waking the task, which is still there. I can even
imagine that having separate CQEs might be more flexible and would
allow more efficient CQ batching.
I've actaully been working on this issue for a little while now. My current
idea is that an id is put into the optval section of the SQE, and then it
can be used to tag that req with a certain group. When a req has a flag
set on it, it can request for all of group's notifs to be "flushed" in one
notif that encompasses that entire group. If the id is zero, it won't be
associated with a group and will generate a notif. LMK if you see anything
in here that could overcomplicate userspace. I think it's pretty simple,
but you've had a crack at this before so I'd like to hear your opinion.

You can take a look at early versions of the IORING_OP_SEND_ZC, e.g.
patchset v1, probably even later ones. It was basically doing what
you've described with minor uapi changes, like you had to declare groups
(slots) in advance, i.e. register them.
My idea is that insead of allocating slots before making requests, "slots"
will be allocated as the group ids show up. Instead of an array of slots, a
linked list can be used so things can be kmalloc'ed on the fly to make
the uapi simpler.

More flexible and so performant in some circumstances, but the overall
feedback from people trying it is that it's complicated. The user should
allocate group ids, track bound requests / buffers, do other management.
The next question is how the user should decide what bind to what. There
is some nastiness in using the same group for multiple sockets, and then
Then maybe we find a way to prevent multiple sockets on one group.

You don't have to explicitly prevent it unless there are other reasons,
it's just not given a real app would be able to use it this way.

what's the cut line to flush the previous notif? I probably forgot a
I'd make it the max for a u32 -- I'm (probably) going to use an atomic_t
to store the counter of how many reqs have been completed, so a u32 max
would make sense.

To be clear, the question raised is entirely for userspace to decide
if we're talking about the design when the user has to flush a group
notificaiton via flag or so. Atomics or not is a performance side,
that's separate.

couple more complaints.

TL;DR;

The performance is a bit of a longer story, problems are mostly coming
from the async nature of io_uring, and it'd be nice to solve at least a
part of it generically, not only for sendzc. The expensive stuff is
waking up the task, it's not unique to notifications, recv will trigger
it with polling as well as other opcodes. Then the key is completion
batching.
Maybe the interface is made for sendzc first, and people could test it
there. Then if it is considered beneficial to other places, it could be
implemented there.

What's interesting, take for example some tx only toy benchmark with
DEFER_TASKRUN (recommended to use in any case). If you always wait for
sends without notifications and add eventual *_get_events(), that would
completely avoid the wake up overhead if there are enough buffers,
and if it's not it can 1:1 replace tx polling.
Seems like an interesting way to eliminate waiting overhead.

Try groups, see if numbers are good. And a heads up, I'm looking at
I will. Working hard to have the code done by Sunday.

Good, and here is the patchset I mentioned:

https://lore.kernel.org/io-uring/cover.1712923998.git.asml.silence@xxxxxxxxx/T/

improving it a little bit for TCP because of a report, not changing
uapi but might change performance math.

--
Pavel Begunkov