Re: [PATCH v7 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider

Mina Almasry <almasrymina@xxxxxxxxxx> · Fri, 1 Nov 2024 13:06:08 -0700

On Tue, Oct 29, 2024 at 4:06 PM David Wei <dw@xxxxxxxxxxx> wrote:
>
> From: Pavel Begunkov <asml.silence@xxxxxxxxx>
>
> Implement a page pool memory provider for io_uring to receieve in a
> zero copy fashion. For that, the provider allocates user pages wrapped
> around into struct net_iovs, that are stored in a previously registered
> struct net_iov_area.
>
> Unlike with traditional receives, for which pages from a page pool can
> be deallocated right after the user receives data, e.g. via recv(2),
> we extend the lifetime by recycling buffers only after the user space
> acknowledges that it's done processing the data via the refill queue.
> Before handing buffers to the user, we mark them by bumping the refcount
> by a bias value IO_ZC_RX_UREF, which will be checked when the buffer is
> returned back. When the corresponding io_uring instance and/or page pool
> are destroyed, we'll force back all buffers that are currently in the
> user space in ->io_pp_zc_scrub by clearing the bias.
>
> Refcounting and lifetime:
>
> Initially, all buffers are considered unallocated and stored in
> ->freelist, at which point they are not yet directly exposed to the core
> page pool code and not accounted to page pool's pages_state_hold_cnt.
> The ->alloc_netmems callback will allocate them by placing into the
> page pool's cache, setting the refcount to 1 as usual and adjusting
> pages_state_hold_cnt.
>
> Then, either the buffer is dropped and returns back to the page pool
> into the ->freelist via io_pp_zc_release_netmem, in which case the page
> pool will match hold_cnt for us with ->pages_state_release_cnt. Or more
> likely the buffer will go through the network/protocol stacks and end up
> in the corresponding socket's receive queue. From there the user can get
> it via an new io_uring request implemented in following patches. As
> mentioned above, before giving a buffer to the user we bump the refcount
> by IO_ZC_RX_UREF.
>
> Once the user is done with the buffer processing, it must return it back
> via the refill queue, from where our ->alloc_netmems implementation can
> grab it, check references, put IO_ZC_RX_UREF, and recycle the buffer if
> there are no more users left. As we place such buffers right back into
> the page pools fast cache and they didn't go through the normal pp
> release path, they are still considered "allocated" and no pp hold_cnt
> is required. For the same reason we dma sync buffers for the device
> in io_zc_add_pp_cache().
>
> Signed-off-by: Pavel Begunkov <asml.silence@xxxxxxxxx>
> Signed-off-by: David Wei <dw@xxxxxxxxxxx>
> ---
>  io_uring/zcrx.c | 215 ++++++++++++++++++++++++++++++++++++++++++++++++
>  io_uring/zcrx.h |   5 ++
>  2 files changed, 220 insertions(+)
>
> diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
> index a276572fe953..aad35676207e 100644
> --- a/io_uring/zcrx.c
> +++ b/io_uring/zcrx.c
> @@ -2,7 +2,12 @@
>  #include <linux/kernel.h>
>  #include <linux/errno.h>
>  #include <linux/mm.h>
> +#include <linux/nospec.h>
> +#include <linux/netdevice.h>
>  #include <linux/io_uring.h>
> +#include <net/page_pool/helpers.h>
> +#include <net/page_pool/memory_provider.h>
> +#include <trace/events/page_pool.h>
>
>  #include <uapi/linux/io_uring.h>
>
> @@ -14,6 +19,16 @@
>
>  #define IO_RQ_MAX_ENTRIES              32768
>
> +__maybe_unused
> +static const struct memory_provider_ops io_uring_pp_zc_ops;
> +
> +static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *niov)
> +{
> +       struct net_iov_area *owner = net_iov_owner(niov);
> +
> +       return container_of(owner, struct io_zcrx_area, nia);
> +}
> +

We discussed this before I disappeared on vacation but I'm not too
convinced to be honest, sorry.

It's invalid to call io_zcrx_iov_to_area on a devmem niov and vice
versa, right? So current and future code has to be very careful to
call the right helpers on the right niovs.

At the very least there needs to be a comment above all these
container_of helpers:

/* caller must have verified that this niov is devmem/io_zcrx */.

However I feel like even a comment is extremely error prone. These
container_of's are inside of the call stack of some helpers. I would
say we need a check. If we're concerned about performance, the check
can be behind DEBUG_NET_WARN_ON(), although even that is a bit iffy,
but could be fine. Doing this without a check seems too risky to me.

>  static int io_allocate_rbuf_ring(struct io_zcrx_ifq *ifq,
>                                  struct io_uring_zcrx_ifq_reg *reg)
>  {
> @@ -99,6 +114,9 @@ static int io_zcrx_create_area(struct io_ring_ctx *ctx,
>                 goto err;
>
>         for (i = 0; i < nr_pages; i++) {
> +               struct net_iov *niov = &area->nia.niovs[i];
> +
> +               niov->owner = &area->nia;
>                 area->freelist[i] = i;
>         }
>
> @@ -230,3 +248,200 @@ void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx)
>  {
>         lockdep_assert_held(&ctx->uring_lock);
>  }
> +
> +static bool io_zcrx_niov_put(struct net_iov *niov, int nr)
> +{
> +       return atomic_long_sub_and_test(nr, &niov->pp_ref_count);
> +}
> +
> +static bool io_zcrx_put_niov_uref(struct net_iov *niov)
> +{
> +       if (atomic_long_read(&niov->pp_ref_count) < IO_ZC_RX_UREF)
> +               return false;
> +
> +       return io_zcrx_niov_put(niov, IO_ZC_RX_UREF);
> +}
> +

Sorry, I have to push back a bit against this. The refcounting of
netmem is already complicated. the paged netmem has 2 refcounts and
care needs to be taken when acquiring and dropping refcounts. net_iov
inherited the pp_ref_count but not the paged refcount, and again need
some special handling. skb_frag_unref takes very special care checking
pp->recycle, is_pp_netmem, and others to figure out the correct
refcount to put based on the type of the netmem and skb flag.

This code ignores all these generic code
skb_frag_unref/napi_pp_put_page/etc paths and uses raw access to
niv->pp_ref_count. If this is merged as-is, for posterity any changes
in netmem refcounting need to also account for this use case opting
out of these generic code paths that handle all other skb reffing
including devmem.

Additionally since you're opting out of the generic unreffing paths
you're also (as mentioned before) bypassing the pp recycling. AFAICT
that may be hurting your performance. IIUC you refill
PP_ALLOC_CACHE_REFILL (64) entries everytime _alloc_netmems is
entered, and you don't recycle netmem any other way, so your slow path
is entered 1/64 of the page_pool_alloc calls? That should be much
worse than what the normal pp recycling does, which returns all freed
netmem into its alloc.cache or the ptr_ring and hits *_alloc_netmems
much more rarely. There are also regular perf improvements and testing
to the generic pool recycling paths you're also opting out of.

I see a lot of downsides to opting out of the generic use cases. Is
there any reason the normal freeing paths are not applicable to your
use case?

> +static inline void io_zc_add_pp_cache(struct page_pool *pp,
> +                                     struct net_iov *niov)
> +{
> +}
> +

Looks unused/empty.

> +static inline u32 io_zcrx_rqring_entries(struct io_zcrx_ifq *ifq)
> +{
> +       u32 entries;
> +
> +       entries = smp_load_acquire(&ifq->rq_ring->tail) - ifq->cached_rq_head;
> +       return min(entries, ifq->rq_entries);
> +}
> +
> +static struct io_uring_zcrx_rqe *io_zcrx_get_rqe(struct io_zcrx_ifq *ifq,
> +                                                unsigned mask)
> +{
> +       unsigned int idx = ifq->cached_rq_head++ & mask;
> +
> +       return &ifq->rqes[idx];
> +}
> +
> +static void io_zcrx_ring_refill(struct page_pool *pp,
> +                               struct io_zcrx_ifq *ifq)
> +{
> +       unsigned int entries = io_zcrx_rqring_entries(ifq);
> +       unsigned int mask = ifq->rq_entries - 1;
> +
> +       entries = min_t(unsigned, entries, PP_ALLOC_CACHE_REFILL - pp->alloc.count);
> +       if (unlikely(!entries))
> +               return;
> +
> +       do {
> +               struct io_uring_zcrx_rqe *rqe = io_zcrx_get_rqe(ifq, mask);
> +               struct io_zcrx_area *area;
> +               struct net_iov *niov;
> +               unsigned niov_idx, area_idx;
> +
> +               area_idx = rqe->off >> IORING_ZCRX_AREA_SHIFT;
> +               niov_idx = (rqe->off & ~IORING_ZCRX_AREA_MASK) / PAGE_SIZE;
> +
> +               if (unlikely(rqe->__pad || area_idx))
> +                       continue;
> +               area = ifq->area;
> +
> +               if (unlikely(niov_idx >= area->nia.num_niovs))
> +                       continue;
> +               niov_idx = array_index_nospec(niov_idx, area->nia.num_niovs);
> +
> +               niov = &area->nia.niovs[niov_idx];
> +               if (!io_zcrx_put_niov_uref(niov))
> +                       continue;
> +               page_pool_mp_return_in_cache(pp, net_iov_to_netmem(niov));

I'm sorry I pushed back against the provider filling the pp cache in
earlier iterations. That was very wrong.

*_alloc_netmems was meant to be a mirror of
__page_pool_alloc_pages_slow. Since the latter fills the pp->cache,
the former should also fill the pp->cache. The dmabuf mp is actually
deficient in this regard. I'll look into fixing it. This part is more
than fine for me.

I'm still unsure about opting out of the generic freeing paths as I
mentioned above though.

> +       } while (--entries);
> +
> +       smp_store_release(&ifq->rq_ring->head, ifq->cached_rq_head);
> +}
> +
> +static void io_zcrx_refill_slow(struct page_pool *pp, struct io_zcrx_ifq *ifq)
> +{
> +       struct io_zcrx_area *area = ifq->area;
> +
> +       spin_lock_bh(&area->freelist_lock);
> +       while (area->free_count && pp->alloc.count < PP_ALLOC_CACHE_REFILL) {
> +               struct net_iov *niov;
> +               u32 pgid;
> +
> +               pgid = area->freelist[--area->free_count];
> +               niov = &area->nia.niovs[pgid];
> +
> +               page_pool_mp_return_in_cache(pp, net_iov_to_netmem(niov));
> +
> +               pp->pages_state_hold_cnt++;
> +               trace_page_pool_state_hold(pp, net_iov_to_netmem(niov),
> +                                          pp->pages_state_hold_cnt);
> +       }
> +       spin_unlock_bh(&area->freelist_lock);
> +}
> +
> +static void io_zcrx_recycle_niov(struct net_iov *niov)
> +{
> +       struct io_zcrx_area *area = io_zcrx_iov_to_area(niov);
> +
> +       spin_lock_bh(&area->freelist_lock);
> +       area->freelist[area->free_count++] = net_iov_idx(niov);
> +       spin_unlock_bh(&area->freelist_lock);
> +}
> +
> +static netmem_ref io_pp_zc_alloc_netmems(struct page_pool *pp, gfp_t gfp)
> +{
> +       struct io_zcrx_ifq *ifq = pp->mp_priv;
> +
> +       /* pp should already be ensuring that */
> +       if (unlikely(pp->alloc.count))
> +               goto out_return;
> +
> +       io_zcrx_ring_refill(pp, ifq);
> +       if (likely(pp->alloc.count))
> +               goto out_return;
> +
> +       io_zcrx_refill_slow(pp, ifq);
> +       if (!pp->alloc.count)
> +               return 0;
> +out_return:
> +       return pp->alloc.cache[--pp->alloc.count];
> +}
> +
> +static bool io_pp_zc_release_netmem(struct page_pool *pp, netmem_ref netmem)
> +{
> +       struct net_iov *niov;
> +
> +       if (WARN_ON_ONCE(!netmem_is_net_iov(netmem)))
> +               return false;
> +
> +       niov = netmem_to_net_iov(netmem);
> +
> +       if (io_zcrx_niov_put(niov, 1))
> +               io_zcrx_recycle_niov(niov);
> +       return false;
> +}
> +
> +static void io_pp_zc_scrub(struct page_pool *pp)
> +{
> +       struct io_zcrx_ifq *ifq = pp->mp_priv;
> +       struct io_zcrx_area *area = ifq->area;
> +       int i;
> +
> +       /* Reclaim back all buffers given to the user space. */
> +       for (i = 0; i < area->nia.num_niovs; i++) {
> +               struct net_iov *niov = &area->nia.niovs[i];
> +               int count;
> +
> +               if (!io_zcrx_put_niov_uref(niov))
> +                       continue;
> +               io_zcrx_recycle_niov(niov);
> +
> +               count = atomic_inc_return_relaxed(&pp->pages_state_release_cnt);
> +               trace_page_pool_state_release(pp, net_iov_to_netmem(niov), count);
> +       }
> +}
> +
> +static int io_pp_zc_init(struct page_pool *pp)
> +{
> +       struct io_zcrx_ifq *ifq = pp->mp_priv;
> +       struct io_zcrx_area *area = ifq->area;
> +       int ret;
> +
> +       if (!ifq)
> +               return -EINVAL;
> +       if (pp->p.order != 0)
> +               return -EINVAL;
> +       if (!pp->p.napi)
> +               return -EINVAL;
> +
> +       ret = page_pool_mp_init_paged_area(pp, &area->nia, area->pages);
> +       if (ret)
> +               return ret;
> +
> +       percpu_ref_get(&ifq->ctx->refs);
> +       ifq->pp = pp;
> +       return 0;
> +}
> +
> +static void io_pp_zc_destroy(struct page_pool *pp)
> +{
> +       struct io_zcrx_ifq *ifq = pp->mp_priv;
> +       struct io_zcrx_area *area = ifq->area;
> +
> +       page_pool_mp_release_area(pp, &ifq->area->nia);
> +
> +       ifq->pp = NULL;
> +
> +       if (WARN_ON_ONCE(area->free_count != area->nia.num_niovs))
> +               return;
> +       percpu_ref_put(&ifq->ctx->refs);
> +}
> +
> +static const struct memory_provider_ops io_uring_pp_zc_ops = {
> +       .alloc_netmems          = io_pp_zc_alloc_netmems,
> +       .release_netmem         = io_pp_zc_release_netmem,
> +       .init                   = io_pp_zc_init,
> +       .destroy                = io_pp_zc_destroy,
> +       .scrub                  = io_pp_zc_scrub,
> +};
> diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h
> index a8db61498c67..464b4bd89b64 100644
> --- a/io_uring/zcrx.h
> +++ b/io_uring/zcrx.h
> @@ -5,6 +5,9 @@
>  #include <linux/io_uring_types.h>
>  #include <net/page_pool/types.h>
>
> +#define IO_ZC_RX_UREF                  0x10000
> +#define IO_ZC_RX_KREF_MASK             (IO_ZC_RX_UREF - 1)
> +
>  struct io_zcrx_area {
>         struct net_iov_area     nia;
>         struct io_zcrx_ifq      *ifq;
> @@ -22,10 +25,12 @@ struct io_zcrx_ifq {
>         struct io_ring_ctx              *ctx;
>         struct net_device               *dev;
>         struct io_zcrx_area             *area;
> +       struct page_pool                *pp;
>
>         struct io_uring                 *rq_ring;
>         struct io_uring_zcrx_rqe        *rqes;
>         u32                             rq_entries;
> +       u32                             cached_rq_head;
>
>         unsigned short                  n_rqe_pages;
>         struct page                     **rqe_pages;
> --
> 2.43.5
>

--
Thanks,
Mina