Re: [PATCH net-next v8 11/17] io_uring/zcrx: implement zerocopy receive pp memory provider

Pavel Begunkov <asml.silence@xxxxxxxxx> · Wed, 11 Dec 2024 14:42:43 +0000

On 12/11/24 00:24, Jakub Kicinski wrote:
On Tue, 10 Dec 2024 04:45:23 +0000 Pavel Begunkov wrote:
Can you say more about the IO_ZC_RX_UREF bias? net_iov is not the page
struct, we can add more fields. In fact we have 8B of padding in it
that can be allocated without growing the struct. So why play with

I guess we can, though it's growing it for everyone not just
io_uring considering how indexing works, i.e. no embedding into
a larger struct.

Right but we literally have 8B of "padding". We only need 32b counter
here, so there will still be 4B of unused space. Not to mention that
net_iov is not cacheline aligned today. Space is not a concern.

biases? You can add a 32b atomic counter for how many refs have been
handed out to the user.

This set does it in a stupid way, but the bias allows to coalesce
operations with it into a single atomic. Regardless, it can be
placed separately, though we still need a good way to optimise
counting. Take a look at my reply with questions in the v7 thread,
I outlined what can work quite well in terms of performance but
needs a clear api for that from net/

I was thinking along the lines of transferring the ownership of
the frags. But let's work on that as a follow up. Atomic add on

That's fine to leave it out for now and deal later, but what's
important for me when going through preliminary shittification of
the project is to have a way to optimise it after and a clear
understanding that it can't be left w/o it, and that there are
no strong opinions that would block it.

The current cache situation is too unfortunate, understandably so
with it being aliased to struct page. pp_ref_count is in the
same line with ->pp and others. Here an iov usually gets modified
by napi, then refcounted from syscall, after deferred skb put will
put it down back at napi context, and in some time after it gets
washed out from the cache, the user will finally return it back
to page pool.

an exclusively owned cacheline is 2 cycles on AMD if I'm looking
correctly.

Sounds too good to be true considering x86 implies a full barrier
for atomics. I wonder where the data comes from?

--
Pavel Begunkov