Re: [PATCH V6 7/8] io_uring/uring_cmd: support provide group kernel buffer

Ming Lei <ming.lei@xxxxxxxxxx> · Tue, 15 Oct 2024 19:05:35 +0800

On Mon, Oct 14, 2024 at 07:40:40PM +0100, Pavel Begunkov wrote:
> On 10/11/24 16:45, Ming Lei wrote:
> > On Fri, Oct 11, 2024 at 08:41:03AM -0600, Jens Axboe wrote:
> > > On 10/11/24 8:20 AM, Ming Lei wrote:
> > > > On Fri, Oct 11, 2024 at 07:24:27AM -0600, Jens Axboe wrote:
> > > > > On 10/10/24 9:07 PM, Ming Lei wrote:
> > > > > > On Thu, Oct 10, 2024 at 08:39:12PM -0600, Jens Axboe wrote:
> > > > > > > On 10/10/24 8:30 PM, Ming Lei wrote:
> > > > > > > > Hi Jens,
> ...
> > > > > > Suppose we have N consumers OPs which depends on OP_BUF_UPDATE.
> > > > > > 
> > > > > > 1) all N OPs are linked with OP_BUF_UPDATE
> > > > > > 
> > > > > > Or
> > > > > > 
> > > > > > 2) submit OP_BUF_UPDATE first, and wait its completion, then submit N
> > > > > > OPs concurrently.
> > > > > 
> > > > > Correct
> > > > > 
> > > > > > But 1) and 2) may slow the IO handing.  In 1) all N OPs are serialized,
> > > > > > and 1 extra syscall is introduced in 2).
> > > > > 
> > > > > Yes you don't want do do #1. But the OP_BUF_UPDATE is cheap enough that
> > > > > you can just do it upfront. It's not ideal in terms of usage, and I get
> > > > > where the grouping comes from. But is it possible to do the grouping in
> > > > > a less intrusive fashion with OP_BUF_UPDATE? Because it won't change any
> > > > 
> > > > The most of 'intrusive' change is just on patch 4, and Pavel has commented
> > > > that it is good enough:
> > > > 
> > > > https://lore.kernel.org/linux-block/ZwZzsPcXyazyeZnu@fedora/T/#m551e94f080b80ccbd2561e01da5ea8e17f7ee15d
> 
> Trying to catch up on the thread. I do think the patch is tolerable and
> mergeable, but I do it adds quite a bit of complication to the path if
> you try to have a map in what state a request can be and what

I admit that sqe group adds a little complexity to the submission &
completion code, especially dealing with completion code.

But with your help, patch 4 has become easy to follow and sqe group
is well-defined now, and it does add new feature of N:M dependency,
otherwise one extra syscall is required for supporting N:M dependency,
this way not only saves one syscall, but also simplify application.

> dependencies are there, and then patches after has to go to every each
> io_uring opcode and add support for leased buffers. And I'm afraid

Only fast IO(net, fs) needs it, not see other OPs for such support.

> that we'll also need to feedback from completion of those to let
> the buffer know what ranges now has data / initialised. One typical
> problem for page flipping rx, for example, is that you need to have
> a full page of data to map it, otherwise it should be prezeroed,
> which is too expensive, same problem you can have without mmap'ing
> and directly exposing pages to the user.