Re: [PATCH v1 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider

Pavel Begunkov <asml.silence@xxxxxxxxx> · Thu, 10 Oct 2024 22:22:47 +0100

On 10/10/24 21:53, Mina Almasry wrote:
On Thu, Oct 10, 2024 at 1:26 PM Pavel Begunkov <asml.silence@xxxxxxxxx> wrote:
...

Sorry I wasn't clear. By 'this' I'm referring to:

"from where our ->alloc_netmems implementation can grab it, check
references, put IO_ZC_RX_UREF, and recycle the buffer if there are no
more users left"

This is the part that I'm not able to stomach at the moment. Maybe if
I look deeper it would make more sense, but my first feelings is that
it's really not acceptable.

alloc_netmems (and more generically page_pool_alloc_netmem), just
allocates a netmem and gives it to the page_pool code to decide

That how it works because that's how devmem needs it and you
tailored it, not the other way around. It could've pretty well
been a callback that fills the cache as an intermediate, from
where page pool can grab netmems and return back to the user,
and it would've been a pretty clean interface as well.

It could have been, but that would be a much worse design IMO. The
whole point of memory proivders is that they provide memory to the
page_pool and the page_pool does its things (among which is recycling)
with that memory. In this patch you seem to have implemented a
provider which, if the page is returned by io_uring, then it's not
returned to the page_pool, it's returned directly to the provider. In
other code paths the memory will be returned to the page_pool.

I.e allocation is always:
provider -> pp -> driver

freeing from io_uring is:
io_uring -> provider -> pp

freeing from tcp stack or driver I'm guessing will be:
tcp stack/driver -> pp -> provider

I'm recommending that the model for memory providers must be in line
with what we do for pages, devmem TCP, and Jakub's out of tree huge
page provider (i.e. everything else using the page_pool). The model is
the streamlined:

Let's not go into the normal pages, because 1) it can't work
any other way in general case, it has to cross the context from
whenever page is to the napi / page pool, and 2) because devmem
TCP and io_uring already deviate from the standard page pool,
by extending lifetime of buffers to user space and more.

And then that's exactly what I'm saying, you recommend it to be
aligned with devmem TCP. And let's not forget that you had to add
batching to that exact syscall return path because of
performance...

...
I doubt this is true or at least there needs to be more info here. The

If you don't believe me, then, please, go ahead and do your testing,
or look through patches addressing it across the stack like [1],
but you'll be able to find many more. I don't have any recent numbers
on indirect calls, but I did a fair share of testing before for
different kinds of overhead, it has always been expensive, can easily
be 1-2% per fast block request, which could be much worse if it's per
page.

[1] https://lore.kernel.org/netdev/cover.1543836966.git.pabeni@xxxxxxxxxx/

page_pool_alloc_netmem() pretty much allocates 1 buffer per callback
for all its current users (regular memory & dmabuf), and that's good
enough to drive 200gbps NICs. What is special about io_uring use case
that this is not good enough?

The reason it is good enough in my experience is that
page_pool_alloc_netmem() is a slow path. netmems are allocated from
that function and heavily recycled by the page_pool afterwards.

That's because how you return buffers back to the page pool, with
io_uring it is a hot path, even though ammortised exactly because
it doesn't just return one buffer at a time.

Right, I guess I understand now. You need to implement your own
recycling in the provider because your model has bypassed the
page_pool recycling - which to me is 90% of the utility of the

So the utility of the page pool is a fast return path for the
standard page mode, i.e. napi_pp_put_page, which it is and is
important, I agree. But then even though we have a better IMO
approach for this "extended to userspace buffer life cycle"
scenario, it has to use that very same return path because...?

page_pool. To make matters worse, the bypass is only there if the
netmems are returned from io_uring, and not bypassed when the netmems
are returned from driver/tcp stack. I'm guessing if you reused the
page_pool recycling in the io_uring return path then it would remove
the need for your provider to implement its own recycling for the
io_uring return case.

Is letting providers bypass and override the page_pool's recycling in
some code paths OK? IMO, no. A maintainer will make the judgement call

Mina, frankly, that's nonsense. If we extend the same logic,
devmem overrides page allocation rules with callbacks, devmem
overrides and violates page pool buffer lifetimes by extending
it to user space, devmem violates and overrides the page pool
object lifetime by binding buffers to sockets. And all of it
I'd rather name extends and enhances to fit in the devmem use
case.

and speak authoritatively here and I will follow, but I do think it's
a (much) worse design.

Sure, I have a completely opposite opinion, that's a much
better approach than returning through a syscall, but I will
agree with you that ultimately the maintainers will say if
that's acceptable for the networking or not.

--
Pavel Begunkov