On Sat, Aug 19, 2023 at 5:51 AM Jesper Dangaard Brouer <jbrouer@xxxxxxxxxx> wrote: > > > > On 10/08/2023 03.57, Mina Almasry wrote: > > Overload the LSB of struct page* to indicate that it's a page_pool_iov. > > > > Refactor mm calls on struct page * into helpers, and add page_pool_iov > > handling on those helpers. Modify callers of these mm APIs with calls to > > these helpers instead. > > > > I don't like of this approach. > This is adding code to the PP (page_pool) fast-path in multiple places. > > I've not had time to run my usual benchmarks, which are here: > > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c > > But I'm sure it will affect performance. > > Regardless of performance, this approach is using ptr-LSB-bits, to hide > that page-pointer are not really struct-pages, feels like force feeding > a solution just to use the page_pool APIs. > > > > In areas where struct page* is dereferenced, add a check for special > > handling of page_pool_iov. > > > > The memory providers producing page_pool_iov can set the LSB on the > > struct page* returned to the page pool. > > > > Note that instead of overloading the LSB of page pointers, we can > > instead define a new union between struct page & struct page_pool_iov and > > compact it in a new type. However, we'd need to implement the code churn > > to modify the page_pool & drivers to use this new type. For this POC > > that is not implemented (feedback welcome). > > > > I've said before, that I prefer multiplexing on page->pp_magic. > For your page_pool_iov the layout would have to match the offset of > pp_magic, to do this. (And if insisting on using PP infra the refcnt > would also need to align). Perhaps I misunderstand, but this suggests continuing to using struct page to demultiplex memory type? I think the feedback has been strong to not multiplex yet another memory type into that struct, that is not a real page. Which is why we went into this direction. This latest series limits the impact largely to networking structures and code. One way or another, there will be a branch and multiplexing. Whether that is in struct page, the page pool or a new netdev mem type as you propose. Any regression in page pool can be avoided in the common case that does not use device mem by placing that behind a static_branch. Would that address your performance concerns? > > On the allocation side, all drivers already use a driver helper > page_pool_dev_alloc_pages() or we could add another (better named) > helper to multiplex between other types of allocators, e.g. a devmem > allocator. > > On free/return/recycle the functions napi_pp_put_page or skb_pp_recycle > could multiplex on pp_magic and call another API. The API could be an > extension to PP helpers, but it could also be a devmap allocator helper. > > IMHO forcing/piggy-bagging everything into page_pool is not the right > solution. I really think netstack need to support different allocator > types. To me this is lifting page_pool into such a netstack alloctator pool. Not sure adding another explicit layer of indirection would be cleaner or faster (potentially more indirect calls). As for the LSB trick: that avoided adding a lot of boilerplate churn with new type and helper functions. > The page pool have been leading the way, yes, but perhaps it is > time to add an API layer that e.g. could be named netmem, that gives us > the multiplexing between allocators. In that process some of page_pool > APIs would be lifted out as common blocks and others remain. > > --Jesper > > > I have a sample implementation of adding a new page_pool_token type > > in the page_pool to give a general idea here: > > https://github.com/torvalds/linux/commit/3a7628700eb7fd02a117db036003bca50779608d > > > > Full branch here: > > https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-pp-tokens > > > > (In the branches above, page_pool_iov is called devmem_slice). > > > > Could also add static_branch to speed up the checks in page_pool_iov > > memory providers are being used. > > > > Signed-off-by: Mina Almasry <almasrymina@xxxxxxxxxx> > > --- > > include/net/page_pool.h | 74 ++++++++++++++++++++++++++++++++++- > > net/core/page_pool.c | 85 ++++++++++++++++++++++++++++------------- > > 2 files changed, 131 insertions(+), 28 deletions(-) > > > > diff --git a/include/net/page_pool.h b/include/net/page_pool.h > > index 537eb36115ed..f08ca230d68e 100644 > > --- a/include/net/page_pool.h > > +++ b/include/net/page_pool.h > > @@ -282,6 +282,64 @@ static inline struct page_pool_iov *page_to_page_pool_iov(struct page *page) > > return NULL; > > } > > > > +static inline int page_pool_page_ref_count(struct page *page) > > +{ > > + if (page_is_page_pool_iov(page)) > > + return page_pool_iov_refcount(page_to_page_pool_iov(page)); > > + > > + return page_ref_count(page); > > +} > > + > > +static inline void page_pool_page_get_many(struct page *page, > > + unsigned int count) > > +{ > > + if (page_is_page_pool_iov(page)) > > + return page_pool_iov_get_many(page_to_page_pool_iov(page), > > + count); > > + > > + return page_ref_add(page, count); > > +} > > + > > +static inline void page_pool_page_put_many(struct page *page, > > + unsigned int count) > > +{ > > + if (page_is_page_pool_iov(page)) > > + return page_pool_iov_put_many(page_to_page_pool_iov(page), > > + count); > > + > > + if (count > 1) > > + page_ref_sub(page, count - 1); > > + > > + put_page(page); > > +} > > + > > +static inline bool page_pool_page_is_pfmemalloc(struct page *page) > > +{ > > + if (page_is_page_pool_iov(page)) > > + return false; > > + > > + return page_is_pfmemalloc(page); > > +} > > + > > +static inline bool page_pool_page_is_pref_nid(struct page *page, int pref_nid) > > +{ > > + /* Assume page_pool_iov are on the preferred node without actually > > + * checking... > > + * > > + * This check is only used to check for recycling memory in the page > > + * pool's fast paths. Currently the only implementation of page_pool_iov > > + * is dmabuf device memory. It's a deliberate decision by the user to > > + * bind a certain dmabuf to a certain netdev, and the netdev rx queue > > + * would not be able to reallocate memory from another dmabuf that > > + * exists on the preferred node, so, this check doesn't make much sense > > + * in this case. Assume all page_pool_iovs can be recycled for now. > > + */ > > + if (page_is_page_pool_iov(page)) > > + return true; > > + > > + return page_to_nid(page) == pref_nid; > > +} > > + > > struct page_pool { > > struct page_pool_params p; > > > > @@ -434,6 +492,9 @@ static inline long page_pool_defrag_page(struct page *page, long nr) > > { > > long ret; > > > > + if (page_is_page_pool_iov(page)) > > + return -EINVAL; > > + > > /* If nr == pp_frag_count then we have cleared all remaining > > * references to the page. No need to actually overwrite it, instead > > * we can leave this to be overwritten by the calling function. > > @@ -494,7 +555,12 @@ static inline void page_pool_recycle_direct(struct page_pool *pool, > > > > static inline dma_addr_t page_pool_get_dma_addr(struct page *page) > > { > > - dma_addr_t ret = page->dma_addr; > > + dma_addr_t ret; > > + > > + if (page_is_page_pool_iov(page)) > > + return page_pool_iov_dma_addr(page_to_page_pool_iov(page)); > > + > > + ret = page->dma_addr; > > > > if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT) > > ret |= (dma_addr_t)page->dma_addr_upper << 16 << 16; > > @@ -504,6 +570,12 @@ static inline dma_addr_t page_pool_get_dma_addr(struct page *page) > > > > static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr) > > { > > + /* page_pool_iovs are mapped and their dma-addr can't be modified. */ > > + if (page_is_page_pool_iov(page)) { > > + DEBUG_NET_WARN_ON_ONCE(true); > > + return; > > + } > > + > > page->dma_addr = addr; > > if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT) > > page->dma_addr_upper = upper_32_bits(addr); > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c > > index 0a7c08d748b8..20c1f74fd844 100644 > > --- a/net/core/page_pool.c > > +++ b/net/core/page_pool.c > > @@ -318,7 +318,7 @@ static struct page *page_pool_refill_alloc_cache(struct page_pool *pool) > > if (unlikely(!page)) > > break; > > > > - if (likely(page_to_nid(page) == pref_nid)) { > > + if (likely(page_pool_page_is_pref_nid(page, pref_nid))) { > > pool->alloc.cache[pool->alloc.count++] = page; > > } else { > > /* NUMA mismatch; > > @@ -363,7 +363,15 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool, > > struct page *page, > > unsigned int dma_sync_size) > > { > > - dma_addr_t dma_addr = page_pool_get_dma_addr(page); > > + dma_addr_t dma_addr; > > + > > + /* page_pool_iov memory provider do not support PP_FLAG_DMA_SYNC_DEV */ > > + if (page_is_page_pool_iov(page)) { > > + DEBUG_NET_WARN_ON_ONCE(true); > > + return; > > + } > > + > > + dma_addr = page_pool_get_dma_addr(page); > > > > dma_sync_size = min(dma_sync_size, pool->p.max_len); > > dma_sync_single_range_for_device(pool->p.dev, dma_addr, > > @@ -375,6 +383,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page) > > { > > dma_addr_t dma; > > > > + if (page_is_page_pool_iov(page)) { > > + /* page_pool_iovs are already mapped */ > > + DEBUG_NET_WARN_ON_ONCE(true); > > + return true; > > + } > > + > > /* Setup DMA mapping: use 'struct page' area for storing DMA-addr > > * since dma_addr_t can be either 32 or 64 bits and does not always fit > > * into page private data (i.e 32bit cpu with 64bit DMA caps) > > @@ -398,14 +412,24 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page) > > static void page_pool_set_pp_info(struct page_pool *pool, > > struct page *page) > > { > > - page->pp = pool; > > - page->pp_magic |= PP_SIGNATURE; > > + if (!page_is_page_pool_iov(page)) { > > + page->pp = pool; > > + page->pp_magic |= PP_SIGNATURE; > > + } else { > > + page_to_page_pool_iov(page)->pp = pool; > > + } > > + > > if (pool->p.init_callback) > > pool->p.init_callback(page, pool->p.init_arg); > > } > > > > static void page_pool_clear_pp_info(struct page *page) > > { > > + if (page_is_page_pool_iov(page)) { > > + page_to_page_pool_iov(page)->pp = NULL; > > + return; > > + } > > + > > page->pp_magic = 0; > > page->pp = NULL; > > } > > @@ -615,7 +639,7 @@ static bool page_pool_recycle_in_cache(struct page *page, > > return false; > > } > > > > - /* Caller MUST have verified/know (page_ref_count(page) == 1) */ > > + /* Caller MUST have verified/know (page_pool_page_ref_count(page) == 1) */ > > pool->alloc.cache[pool->alloc.count++] = page; > > recycle_stat_inc(pool, cached); > > return true; > > @@ -638,9 +662,10 @@ __page_pool_put_page(struct page_pool *pool, struct page *page, > > * refcnt == 1 means page_pool owns page, and can recycle it. > > * > > * page is NOT reusable when allocated when system is under > > - * some pressure. (page_is_pfmemalloc) > > + * some pressure. (page_pool_page_is_pfmemalloc) > > */ > > - if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) { > > + if (likely(page_pool_page_ref_count(page) == 1 && > > + !page_pool_page_is_pfmemalloc(page))) { > > /* Read barrier done in page_ref_count / READ_ONCE */ > > > > if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV) > > @@ -741,7 +766,8 @@ static struct page *page_pool_drain_frag(struct page_pool *pool, > > if (likely(page_pool_defrag_page(page, drain_count))) > > return NULL; > > > > - if (page_ref_count(page) == 1 && !page_is_pfmemalloc(page)) { > > + if (page_pool_page_ref_count(page) == 1 && > > + !page_pool_page_is_pfmemalloc(page)) { > > if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV) > > page_pool_dma_sync_for_device(pool, page, -1); > > > > @@ -818,9 +844,9 @@ static void page_pool_empty_ring(struct page_pool *pool) > > /* Empty recycle ring */ > > while ((page = ptr_ring_consume_bh(&pool->ring))) { > > /* Verify the refcnt invariant of cached pages */ > > - if (!(page_ref_count(page) == 1)) > > + if (!(page_pool_page_ref_count(page) == 1)) > > pr_crit("%s() page_pool refcnt %d violation\n", > > - __func__, page_ref_count(page)); > > + __func__, page_pool_page_ref_count(page)); > > > > page_pool_return_page(pool, page); > > } > > @@ -977,19 +1003,24 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe) > > struct page_pool *pp; > > bool allow_direct; > > > > - page = compound_head(page); > > + if (!page_is_page_pool_iov(page)) { > > + page = compound_head(page); > > > > - /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation > > - * in order to preserve any existing bits, such as bit 0 for the > > - * head page of compound page and bit 1 for pfmemalloc page, so > > - * mask those bits for freeing side when doing below checking, > > - * and page_is_pfmemalloc() is checked in __page_pool_put_page() > > - * to avoid recycling the pfmemalloc page. > > - */ > > - if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE)) > > - return false; > > + /* page->pp_magic is OR'ed with PP_SIGNATURE after the > > + * allocation in order to preserve any existing bits, such as > > + * bit 0 for the head page of compound page and bit 1 for > > + * pfmemalloc page, so mask those bits for freeing side when > > + * doing below checking, and page_pool_page_is_pfmemalloc() is > > + * checked in __page_pool_put_page() to avoid recycling the > > + * pfmemalloc page. > > + */ > > + if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE)) > > + return false; > > > > - pp = page->pp; > > + pp = page->pp; > > + } else { > > + pp = page_to_page_pool_iov(page)->pp; > > + } > > > > /* Allow direct recycle if we have reasons to believe that we are > > * in the same context as the consumer would run, so there's > > @@ -1273,9 +1304,9 @@ static bool mp_huge_busy(struct mp_huge *hu, unsigned int idx) > > > > for (j = 0; j < (1 << MP_HUGE_ORDER); j++) { > > page = hu->page[idx] + j; > > - if (page_ref_count(page) != 1) { > > + if (page_pool_page_ref_count(page) != 1) { > > pr_warn("Page with ref count %d at %u, %u. Can't safely destory, leaking memory!\n", > > - page_ref_count(page), idx, j); > > + page_pool_page_ref_count(page), idx, j); > > return true; > > } > > } > > @@ -1330,7 +1361,7 @@ static struct page *mp_huge_alloc_pages(struct page_pool *pool, gfp_t gfp) > > continue; > > > > if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE || > > - page_ref_count(page) != 1) { > > + page_pool_page_ref_count(page) != 1) { > > atomic_inc(&mp_huge_ins_b); > > continue; > > } > > @@ -1458,9 +1489,9 @@ static void mp_huge_1g_destroy(struct page_pool *pool) > > free = true; > > for (i = 0; i < MP_HUGE_1G_CNT; i++) { > > page = hu->page + i; > > - if (page_ref_count(page) != 1) { > > + if (page_pool_page_ref_count(page) != 1) { > > pr_warn("Page with ref count %d at %u. Can't safely destory, leaking memory!\n", > > - page_ref_count(page), i); > > + page_pool_page_ref_count(page), i); > > free = false; > > break; > > } > > @@ -1489,7 +1520,7 @@ static struct page *mp_huge_1g_alloc_pages(struct page_pool *pool, gfp_t gfp) > > page = hu->page + page_i; > > > > if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE || > > - page_ref_count(page) != 1) { > > + page_pool_page_ref_count(page) != 1) { > > atomic_inc(&mp_huge_ins_b); > > continue; > > } > > -- > > 2.41.0.640.ga95def55d0-goog > > >