Re: [RFC PATCH v3 13/20] io_uring: implement pp memory provider for zc rx

Pavel Begunkov <asml.silence@xxxxxxxxx> · Thu, 21 Dec 2023 19:36:27 +0000

On 12/19/23 21:03, David Wei wrote:
From: Pavel Begunkov <asml.silence@xxxxxxxxx>

We're adding a new pp memory provider to implement io_uring zerocopy
receive. It'll be "registered" in pp and used in later paches.

The typical life cycle of a buffer goes as follows: first it's allocated
to a driver with the initial refcount set to 1. The drivers fills it
with data, puts it into an skb and passes down the stack, where it gets
queued up to a socket. Later, a zc io_uring request will be receiving
data from the socket from a task context. At that point io_uring will
tell the userspace that this buffer has some data by posting an
appropriate completion. It'll also elevating the refcount by
IO_ZC_RX_UREF, so the buffer is not recycled while userspace is reading
the data. When the userspace is done with the buffer it should return it
back to io_uring by adding an entry to the buffer refill ring. When
necessary io_uring will poll the refill ring, compare references
including IO_ZC_RX_UREF and reuse the buffer.

Initally, all buffers are placed in a spinlock protected ->freelist.
It's a slow path stash, where buffers are considered to be unallocated
and not exposed to core page pool. On allocation, pp will first try
all its caches, and the ->alloc_pages callback if everything else
failed.

The hot path for io_pp_zc_alloc_pages() is to grab pages from the refill
ring. The consumption from the ring is always done in the attached napi
context, so no additional synchronisation required. If that fails we'll
be getting buffers from the ->freelist.

Note: only ->freelist are considered unallocated for page pool, so we
only add pages_state_hold_cnt when allocating from there. Subsequently,
as page_pool_return_page() and others bump the ->pages_state_release_cnt
counter, io_pp_zc_release_page() can only use ->freelist, which is not a
problem as it's not a slow path.

Signed-off-by: Pavel Begunkov <asml.silence@xxxxxxxxx>
Signed-off-by: David Wei <dw@xxxxxxxxxxx>
---
...
+static void io_zc_rx_ring_refill(struct page_pool *pp,
+				 struct io_zc_rx_ifq *ifq)
+{
+	unsigned int entries = io_zc_rx_rqring_entries(ifq);
+	unsigned int mask = ifq->rq_entries - 1;
+	struct io_zc_rx_pool *pool = ifq->pool;
+
+	if (unlikely(!entries))
+		return;
+
+	while (entries--) {
+		unsigned int rq_idx = ifq->cached_rq_head++ & mask;
+		struct io_uring_rbuf_rqe *rqe = &ifq->rqes[rq_idx];
+		u32 pgid = rqe->off / PAGE_SIZE;
+		struct io_zc_rx_buf *buf = &pool->bufs[pgid];
+
+		if (!io_zc_rx_put_buf_uref(buf))
+			continue;

It's worth to note that here we have to add a dma sync as per
discussions with page pool folks.

+		io_zc_add_pp_cache(pp, buf);
+		if (pp->alloc.count >= PP_ALLOC_CACHE_REFILL)
+			break;
+	}
+	smp_store_release(&ifq->ring->rq.head, ifq->cached_rq_head);
+}
+
+static void io_zc_rx_refill_slow(struct page_pool *pp, struct io_zc_rx_ifq *ifq)
+{
+	struct io_zc_rx_pool *pool = ifq->pool;
+
+	spin_lock_bh(&pool->freelist_lock);
+	while (pool->free_count && pp->alloc.count < PP_ALLOC_CACHE_REFILL) {
+		struct io_zc_rx_buf *buf;
+		u32 pgid;
+
+		pgid = pool->freelist[--pool->free_count];
+		buf = &pool->bufs[pgid];
+
+		io_zc_add_pp_cache(pp, buf);
+		pp->pages_state_hold_cnt++;
+		trace_page_pool_state_hold(pp, io_zc_buf_to_pp_page(buf),
+					   pp->pages_state_hold_cnt);
+	}
+	spin_unlock_bh(&pool->freelist_lock);
+}
...

--
Pavel Begunkov