Re: [PATCH V6 7/8] io_uring/uring_cmd: support provide group kernel buffer

Pavel Begunkov <asml.silence@xxxxxxxxx> · Mon, 14 Oct 2024 19:40:40 +0100

On 10/11/24 16:45, Ming Lei wrote:
On Fri, Oct 11, 2024 at 08:41:03AM -0600, Jens Axboe wrote:
On 10/11/24 8:20 AM, Ming Lei wrote:
On Fri, Oct 11, 2024 at 07:24:27AM -0600, Jens Axboe wrote:
On 10/10/24 9:07 PM, Ming Lei wrote:
On Thu, Oct 10, 2024 at 08:39:12PM -0600, Jens Axboe wrote:
On 10/10/24 8:30 PM, Ming Lei wrote:
Hi Jens,
...
Suppose we have N consumers OPs which depends on OP_BUF_UPDATE.

1) all N OPs are linked with OP_BUF_UPDATE

Or

2) submit OP_BUF_UPDATE first, and wait its completion, then submit N
OPs concurrently.

Correct

But 1) and 2) may slow the IO handing.  In 1) all N OPs are serialized,
and 1 extra syscall is introduced in 2).

Yes you don't want do do #1. But the OP_BUF_UPDATE is cheap enough that
you can just do it upfront. It's not ideal in terms of usage, and I get
where the grouping comes from. But is it possible to do the grouping in
a less intrusive fashion with OP_BUF_UPDATE? Because it won't change any

The most of 'intrusive' change is just on patch 4, and Pavel has commented
that it is good enough:

https://lore.kernel.org/linux-block/ZwZzsPcXyazyeZnu@fedora/T/#m551e94f080b80ccbd2561e01da5ea8e17f7ee15d

Trying to catch up on the thread. I do think the patch is tolerable and
mergeable, but I do it adds quite a bit of complication to the path if
you try to have a map in what state a request can be and what
dependencies are there, and then patches after has to go to every each
io_uring opcode and add support for leased buffers. And I'm afraid
that we'll also need to feedback from completion of those to let
the buffer know what ranges now has data / initialised. One typical
problem for page flipping rx, for example, is that you need to have
a full page of data to map it, otherwise it should be prezeroed,
which is too expensive, same problem you can have without mmap'ing
and directly exposing pages to the user.

At least for me, patch 4 looks fine. The problem occurs when you start
needing to support this different buffer type, which is in patch 6. I'm
not saying we can necessarily solve this with OP_BUF_UPDATE, I just want
to explore that path because if we can, then patch 6 turns into "oh
let's just added registered/fixed buffer support to these ops that don't
currently support it". And that would be much nicer indeed.
...
would be totally fine in terms of performance. OP_BUF_UPDATE will
_always_ completely immediately and inline, which means that it'll
_always_ be immediately available post submission. The only think you'd
ever have to worry about in terms of failure is a badly formed request,
which is a programming issue, or running out of memory on the host.

Also it makes error handling more complicated, io_uring has to remove
the kernel buffer when the current task is exit, dependency or order with
buffer provider is introduced.

Why would that be? They belong to the ring, so should be torn down as
part of the ring anyway? Why would they be task-private, but not
ring-private?

It is kernel buffer, which belongs to provider(such as ublk) instead
of uring, application may panic any time, then io_uring has to remove
the buffer for notifying the buffer owner.

But it could be an application buffer, no? You'd just need the
application to provide it to ublk and have it mapped, rather than have
ublk allocate it in-kernel and then use that.

The buffer is actually kernel 'request/bio' pages of /dev/ublkbN, and now we
forward and borrow it to io_uring OPs(fs rw, net send/recv), so it can't be
application buffer, not same with net rx.

I don't see any problem in dropping buffers from the table
on exit, we have a lot of stuff a thread does for io_uring
when it exits.

In concept grouping is simpler because:

- buffer lifetime is aligned with group leader lifetime, so we needn't
worry buffer leak because of application accidental exit

But if it was an application buffer, that would not be a concern.

Yeah, but storage isn't same with network, here application buffer can't
support zc.

Maybe I missed how it came to app buffers, but the thing I
initially mentioned is about storing the kernel buffer in
the table, without any user pointers and user buffers.

- the buffer is borrowed to consumer OPs, and returned back after all
consumers are done, this way avoids any dependency

Meantime OP_BUF_UPDATE(provide buffer OP, remove buffer OP) becomes more
complicated:

- buffer leak because of app panic

Then io_uring dies and releases buffers. Or we can even add
some code removing it, as mentioned, any task that has ever
submitted a request already runs some io_uring code on exit.

- buffer dependency issue: consumer OPs depend on provide buffer OP,
	remove buffer OP depends on consumer OPs; two syscalls has to be
	added for handling single ublk IO.

Seems like most of this is because of the kernel buffer too, no?

Yeah.

I do like the concept of the ephemeral buffer, the downside is that we
need per-op support for it too. And while I'm not totally against doing

Can you explain per-op support a bit?

Now the buffer has been provided by one single uring command.

that, it would be lovely if we could utilize and existing mechanism for
that rather than add another one.

That would also be more flexible as not everything can be
handled by linked request logic, and wouldn't require hacking
into every each request type to support "consuming" leased
buffers.

Overhead wise, let's say we fix buffer binding order and delay it
as elaborated on below, then you can provide a buffer and link a
consumer (e.g. send request or anything else) just as you do
it now. You can also link a request returning the buffer to the
same chain if you don't need extra flexibility.

As for groups, they're complicated because of the order inversion,
the notion of a leader and so. If we get rid of the need to impose
more semantics onto it by mediating buffer transition through the
table, I think we can do groups if needed but make it simpler.

What's preventing it from registering it in ->prep()? It would be a bit
odd, but there would be nothing preventing it codewise, outside of the
oddity of ->prep() not being idempotent at that point. Don't follow why
that would be necessary, though, can you expand?

->prep() doesn't export to uring cmd, and we may not want to bother
drivers.

Also remove buffer still can't be done in ->prep().

Not dig into further, one big thing could be that dependency isn't
respected in ->prep().

And we can just fix that and move the choosing of a buffer
to ->issue(), in which case a buffer provided by one request
will be observable to its linked requests.

--
Pavel Begunkov