On Tue, Aug 22, 2023 at 5:24 AM Jesper Dangaard Brouer <jbrouer@xxxxxxxxxx> wrote: > > > > On 22/08/2023 08.05, Mina Almasry wrote: > > On Sat, Aug 19, 2023 at 2:51 AM Jesper Dangaard Brouer > > <jbrouer@xxxxxxxxxx> wrote: > >> > >> On 10/08/2023 03.57, Mina Almasry wrote: > >>> Overload the LSB of struct page* to indicate that it's a page_pool_iov. > >>> > >>> Refactor mm calls on struct page * into helpers, and add page_pool_iov > >>> handling on those helpers. Modify callers of these mm APIs with calls to > >>> these helpers instead. > >>> > >> > >> I don't like of this approach. > >> This is adding code to the PP (page_pool) fast-path in multiple places. > >> > >> I've not had time to run my usual benchmarks, which are here: > >> > >> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c > >> > > > > I ported over this benchmark to my tree and ran it, my results: > > > > What CPU is this and GHz? (I guess 2.6 GHz based on results). > > (It looks like this CPU is more efficient, instructions per cycles, than > my E5-1650 v4 @ 3.60GHz). > cat /proc/cpuinfo ... vendor_id : GenuineIntel cpu family : 6 model : 143 model name : Intel(R) Xeon(R) Platinum 8481C CPU @ 2.70GHz stepping : 8 microcode : 0xffffffff cpu MHz : 2699.998 ``` This is a vCPU on the Google Cloud A3 VMs. > > net-next @ b44693495af8 > > https://pastebin.com/raw/JuU7UQXe > > > > + Jakub's memory-provider APIs: > > https://pastebin.com/raw/StMBhetn > > > > + devmem TCP changes: > > https://pastebin.com/raw/mY1L6U4r > > > > Only a single cycle slowdown for "page_pool01_fast_path". > From 10 cycles to 11 cycles. > > > + intentional regression just to make sure the benchmark is working: > > https://pastebin.com/raw/wqWhcJdG > > > > I don't seem to be able to detect a regression with this series as-is, > > but I'm not that familiar with the test and may be doing something > > wrong or misinterpreting the results. Does this look ok to you? > > > > The performance results are better than I expected. The small > regression from 10 cycles to 11 cycles is actually 10%, but I expect > with some likely/unlikely instrumentation we can "likely" remove this again. > So the patch is already optimized carefully (I hope) to put all the devmem processing in the default unlikely path. Willem showed me that: if (page_pool_iov()) return handle_page_pool_iov(); return handle_page(); The handle_page() will be 'likely' by default, which removes the need for explicit likely/unlikely. I'm not sure we can get better perf with explicit likely/unlikey, but I can try. > So, this change actually looks acceptable from a performance PoV. > I still think this page_pool_iov is very invasive to page_pool, but > maybe it is better to hide this "uglyness" inside page_pool. > > The test primarily tests fast-path, and you also add "if" statements to > all the DMA operations, which is not part of this benchmark. Perhaps we > can add unlikely statements, or inspect (objdump) the ASM to check code > priorities the original page based "provider". > > >> But I'm sure it will affect performance. > >> > > Guess, I was wrong ;-) > > --Jesper > > > >> Regardless of performance, this approach is using ptr-LSB-bits, to hide > >> that page-pointer are not really struct-pages, feels like force feeding > >> a solution just to use the page_pool APIs. > >> > >> > >>> In areas where struct page* is dereferenced, add a check for special > >>> handling of page_pool_iov. > >>> > >>> The memory providers producing page_pool_iov can set the LSB on the > >>> struct page* returned to the page pool. > >>> > >>> Note that instead of overloading the LSB of page pointers, we can > >>> instead define a new union between struct page & struct page_pool_iov and > >>> compact it in a new type. However, we'd need to implement the code churn > >>> to modify the page_pool & drivers to use this new type. For this POC > >>> that is not implemented (feedback welcome). > >>> > >> > >> I've said before, that I prefer multiplexing on page->pp_magic. > >> For your page_pool_iov the layout would have to match the offset of > >> pp_magic, to do this. (And if insisting on using PP infra the refcnt > >> would also need to align). > >> > >> On the allocation side, all drivers already use a driver helper > >> page_pool_dev_alloc_pages() or we could add another (better named) > >> helper to multiplex between other types of allocators, e.g. a devmem > >> allocator. > >> > >> On free/return/recycle the functions napi_pp_put_page or skb_pp_recycle > >> could multiplex on pp_magic and call another API. The API could be an > >> extension to PP helpers, but it could also be a devmap allocator helper. > >> > >> IMHO forcing/piggy-bagging everything into page_pool is not the right > >> solution. I really think netstack need to support different allocator > >> types. The page pool have been leading the way, yes, but perhaps it is > >> time to add an API layer that e.g. could be named netmem, that gives us > >> the multiplexing between allocators. In that process some of page_pool > >> APIs would be lifted out as common blocks and others remain. > >> > >> --Jesper > >> > >>> I have a sample implementation of adding a new page_pool_token type > >>> in the page_pool to give a general idea here: > >>> https://github.com/torvalds/linux/commit/3a7628700eb7fd02a117db036003bca50779608d > >>> > >>> Full branch here: > >>> https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-pp-tokens > >>> > >>> (In the branches above, page_pool_iov is called devmem_slice). > >>> > >>> Could also add static_branch to speed up the checks in page_pool_iov > >>> memory providers are being used. > >>> > >>> Signed-off-by: Mina Almasry <almasrymina@xxxxxxxxxx> > >>> --- > >>> include/net/page_pool.h | 74 ++++++++++++++++++++++++++++++++++- > >>> net/core/page_pool.c | 85 ++++++++++++++++++++++++++++------------- > >>> 2 files changed, 131 insertions(+), 28 deletions(-) > >>> > >>> diff --git a/include/net/page_pool.h b/include/net/page_pool.h > >>> index 537eb36115ed..f08ca230d68e 100644 > >>> --- a/include/net/page_pool.h > >>> +++ b/include/net/page_pool.h > >>> @@ -282,6 +282,64 @@ static inline struct page_pool_iov *page_to_page_pool_iov(struct page *page) > >>> return NULL; > >>> } > >>> > >>> +static inline int page_pool_page_ref_count(struct page *page) > >>> +{ > >>> + if (page_is_page_pool_iov(page)) > >>> + return page_pool_iov_refcount(page_to_page_pool_iov(page)); > >>> + > >>> + return page_ref_count(page); > >>> +} > >>> + > >>> +static inline void page_pool_page_get_many(struct page *page, > >>> + unsigned int count) > >>> +{ > >>> + if (page_is_page_pool_iov(page)) > >>> + return page_pool_iov_get_many(page_to_page_pool_iov(page), > >>> + count); > >>> + > >>> + return page_ref_add(page, count); > >>> +} > >>> + > >>> +static inline void page_pool_page_put_many(struct page *page, > >>> + unsigned int count) > >>> +{ > >>> + if (page_is_page_pool_iov(page)) > >>> + return page_pool_iov_put_many(page_to_page_pool_iov(page), > >>> + count); > >>> + > >>> + if (count > 1) > >>> + page_ref_sub(page, count - 1); > >>> + > >>> + put_page(page); > >>> +} > >>> + > >>> +static inline bool page_pool_page_is_pfmemalloc(struct page *page) > >>> +{ > >>> + if (page_is_page_pool_iov(page)) > >>> + return false; > >>> + > >>> + return page_is_pfmemalloc(page); > >>> +} > >>> + > >>> +static inline bool page_pool_page_is_pref_nid(struct page *page, int pref_nid) > >>> +{ > >>> + /* Assume page_pool_iov are on the preferred node without actually > >>> + * checking... > >>> + * > >>> + * This check is only used to check for recycling memory in the page > >>> + * pool's fast paths. Currently the only implementation of page_pool_iov > >>> + * is dmabuf device memory. It's a deliberate decision by the user to > >>> + * bind a certain dmabuf to a certain netdev, and the netdev rx queue > >>> + * would not be able to reallocate memory from another dmabuf that > >>> + * exists on the preferred node, so, this check doesn't make much sense > >>> + * in this case. Assume all page_pool_iovs can be recycled for now. > >>> + */ > >>> + if (page_is_page_pool_iov(page)) > >>> + return true; > >>> + > >>> + return page_to_nid(page) == pref_nid; > >>> +} > >>> + > >>> struct page_pool { > >>> struct page_pool_params p; > >>> > >>> @@ -434,6 +492,9 @@ static inline long page_pool_defrag_page(struct page *page, long nr) > >>> { > >>> long ret; > >>> > >>> + if (page_is_page_pool_iov(page)) > >>> + return -EINVAL; > >>> + > >>> /* If nr == pp_frag_count then we have cleared all remaining > >>> * references to the page. No need to actually overwrite it, instead > >>> * we can leave this to be overwritten by the calling function. > >>> @@ -494,7 +555,12 @@ static inline void page_pool_recycle_direct(struct page_pool *pool, > >>> > >>> static inline dma_addr_t page_pool_get_dma_addr(struct page *page) > >>> { > >>> - dma_addr_t ret = page->dma_addr; > >>> + dma_addr_t ret; > >>> + > >>> + if (page_is_page_pool_iov(page)) > >>> + return page_pool_iov_dma_addr(page_to_page_pool_iov(page)); > >>> + > >>> + ret = page->dma_addr; > >>> > >>> if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT) > >>> ret |= (dma_addr_t)page->dma_addr_upper << 16 << 16; > >>> @@ -504,6 +570,12 @@ static inline dma_addr_t page_pool_get_dma_addr(struct page *page) > >>> > >>> static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr) > >>> { > >>> + /* page_pool_iovs are mapped and their dma-addr can't be modified. */ > >>> + if (page_is_page_pool_iov(page)) { > >>> + DEBUG_NET_WARN_ON_ONCE(true); > >>> + return; > >>> + } > >>> + > >>> page->dma_addr = addr; > >>> if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT) > >>> page->dma_addr_upper = upper_32_bits(addr); > >>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c > >>> index 0a7c08d748b8..20c1f74fd844 100644 > >>> --- a/net/core/page_pool.c > >>> +++ b/net/core/page_pool.c > >>> @@ -318,7 +318,7 @@ static struct page *page_pool_refill_alloc_cache(struct page_pool *pool) > >>> if (unlikely(!page)) > >>> break; > >>> > >>> - if (likely(page_to_nid(page) == pref_nid)) { > >>> + if (likely(page_pool_page_is_pref_nid(page, pref_nid))) { > >>> pool->alloc.cache[pool->alloc.count++] = page; > >>> } else { > >>> /* NUMA mismatch; > >>> @@ -363,7 +363,15 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool, > >>> struct page *page, > >>> unsigned int dma_sync_size) > >>> { > >>> - dma_addr_t dma_addr = page_pool_get_dma_addr(page); > >>> + dma_addr_t dma_addr; > >>> + > >>> + /* page_pool_iov memory provider do not support PP_FLAG_DMA_SYNC_DEV */ > >>> + if (page_is_page_pool_iov(page)) { > >>> + DEBUG_NET_WARN_ON_ONCE(true); > >>> + return; > >>> + } > >>> + > >>> + dma_addr = page_pool_get_dma_addr(page); > >>> > >>> dma_sync_size = min(dma_sync_size, pool->p.max_len); > >>> dma_sync_single_range_for_device(pool->p.dev, dma_addr, > >>> @@ -375,6 +383,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page) > >>> { > >>> dma_addr_t dma; > >>> > >>> + if (page_is_page_pool_iov(page)) { > >>> + /* page_pool_iovs are already mapped */ > >>> + DEBUG_NET_WARN_ON_ONCE(true); > >>> + return true; > >>> + } > >>> + > >>> /* Setup DMA mapping: use 'struct page' area for storing DMA-addr > >>> * since dma_addr_t can be either 32 or 64 bits and does not always fit > >>> * into page private data (i.e 32bit cpu with 64bit DMA caps) > >>> @@ -398,14 +412,24 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page) > >>> static void page_pool_set_pp_info(struct page_pool *pool, > >>> struct page *page) > >>> { > >>> - page->pp = pool; > >>> - page->pp_magic |= PP_SIGNATURE; > >>> + if (!page_is_page_pool_iov(page)) { > >>> + page->pp = pool; > >>> + page->pp_magic |= PP_SIGNATURE; > >>> + } else { > >>> + page_to_page_pool_iov(page)->pp = pool; > >>> + } > >>> + > >>> if (pool->p.init_callback) > >>> pool->p.init_callback(page, pool->p.init_arg); > >>> } > >>> > >>> static void page_pool_clear_pp_info(struct page *page) > >>> { > >>> + if (page_is_page_pool_iov(page)) { > >>> + page_to_page_pool_iov(page)->pp = NULL; > >>> + return; > >>> + } > >>> + > >>> page->pp_magic = 0; > >>> page->pp = NULL; > >>> } > >>> @@ -615,7 +639,7 @@ static bool page_pool_recycle_in_cache(struct page *page, > >>> return false; > >>> } > >>> > >>> - /* Caller MUST have verified/know (page_ref_count(page) == 1) */ > >>> + /* Caller MUST have verified/know (page_pool_page_ref_count(page) == 1) */ > >>> pool->alloc.cache[pool->alloc.count++] = page; > >>> recycle_stat_inc(pool, cached); > >>> return true; > >>> @@ -638,9 +662,10 @@ __page_pool_put_page(struct page_pool *pool, struct page *page, > >>> * refcnt == 1 means page_pool owns page, and can recycle it. > >>> * > >>> * page is NOT reusable when allocated when system is under > >>> - * some pressure. (page_is_pfmemalloc) > >>> + * some pressure. (page_pool_page_is_pfmemalloc) > >>> */ > >>> - if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) { > >>> + if (likely(page_pool_page_ref_count(page) == 1 && > >>> + !page_pool_page_is_pfmemalloc(page))) { > >>> /* Read barrier done in page_ref_count / READ_ONCE */ > >>> > >>> if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV) > >>> @@ -741,7 +766,8 @@ static struct page *page_pool_drain_frag(struct page_pool *pool, > >>> if (likely(page_pool_defrag_page(page, drain_count))) > >>> return NULL; > >>> > >>> - if (page_ref_count(page) == 1 && !page_is_pfmemalloc(page)) { > >>> + if (page_pool_page_ref_count(page) == 1 && > >>> + !page_pool_page_is_pfmemalloc(page)) { > >>> if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV) > >>> page_pool_dma_sync_for_device(pool, page, -1); > >>> > >>> @@ -818,9 +844,9 @@ static void page_pool_empty_ring(struct page_pool *pool) > >>> /* Empty recycle ring */ > >>> while ((page = ptr_ring_consume_bh(&pool->ring))) { > >>> /* Verify the refcnt invariant of cached pages */ > >>> - if (!(page_ref_count(page) == 1)) > >>> + if (!(page_pool_page_ref_count(page) == 1)) > >>> pr_crit("%s() page_pool refcnt %d violation\n", > >>> - __func__, page_ref_count(page)); > >>> + __func__, page_pool_page_ref_count(page)); > >>> > >>> page_pool_return_page(pool, page); > >>> } > >>> @@ -977,19 +1003,24 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe) > >>> struct page_pool *pp; > >>> bool allow_direct; > >>> > >>> - page = compound_head(page); > >>> + if (!page_is_page_pool_iov(page)) { > >>> + page = compound_head(page); > >>> > >>> - /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation > >>> - * in order to preserve any existing bits, such as bit 0 for the > >>> - * head page of compound page and bit 1 for pfmemalloc page, so > >>> - * mask those bits for freeing side when doing below checking, > >>> - * and page_is_pfmemalloc() is checked in __page_pool_put_page() > >>> - * to avoid recycling the pfmemalloc page. > >>> - */ > >>> - if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE)) > >>> - return false; > >>> + /* page->pp_magic is OR'ed with PP_SIGNATURE after the > >>> + * allocation in order to preserve any existing bits, such as > >>> + * bit 0 for the head page of compound page and bit 1 for > >>> + * pfmemalloc page, so mask those bits for freeing side when > >>> + * doing below checking, and page_pool_page_is_pfmemalloc() is > >>> + * checked in __page_pool_put_page() to avoid recycling the > >>> + * pfmemalloc page. > >>> + */ > >>> + if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE)) > >>> + return false; > >>> > >>> - pp = page->pp; > >>> + pp = page->pp; > >>> + } else { > >>> + pp = page_to_page_pool_iov(page)->pp; > >>> + } > >>> > >>> /* Allow direct recycle if we have reasons to believe that we are > >>> * in the same context as the consumer would run, so there's > >>> @@ -1273,9 +1304,9 @@ static bool mp_huge_busy(struct mp_huge *hu, unsigned int idx) > >>> > >>> for (j = 0; j < (1 << MP_HUGE_ORDER); j++) { > >>> page = hu->page[idx] + j; > >>> - if (page_ref_count(page) != 1) { > >>> + if (page_pool_page_ref_count(page) != 1) { > >>> pr_warn("Page with ref count %d at %u, %u. Can't safely destory, leaking memory!\n", > >>> - page_ref_count(page), idx, j); > >>> + page_pool_page_ref_count(page), idx, j); > >>> return true; > >>> } > >>> } > >>> @@ -1330,7 +1361,7 @@ static struct page *mp_huge_alloc_pages(struct page_pool *pool, gfp_t gfp) > >>> continue; > >>> > >>> if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE || > >>> - page_ref_count(page) != 1) { > >>> + page_pool_page_ref_count(page) != 1) { > >>> atomic_inc(&mp_huge_ins_b); > >>> continue; > >>> } > >>> @@ -1458,9 +1489,9 @@ static void mp_huge_1g_destroy(struct page_pool *pool) > >>> free = true; > >>> for (i = 0; i < MP_HUGE_1G_CNT; i++) { > >>> page = hu->page + i; > >>> - if (page_ref_count(page) != 1) { > >>> + if (page_pool_page_ref_count(page) != 1) { > >>> pr_warn("Page with ref count %d at %u. Can't safely destory, leaking memory!\n", > >>> - page_ref_count(page), i); > >>> + page_pool_page_ref_count(page), i); > >>> free = false; > >>> break; > >>> } > >>> @@ -1489,7 +1520,7 @@ static struct page *mp_huge_1g_alloc_pages(struct page_pool *pool, gfp_t gfp) > >>> page = hu->page + page_i; > >>> > >>> if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE || > >>> - page_ref_count(page) != 1) { > >>> + page_pool_page_ref_count(page) != 1) { > >>> atomic_inc(&mp_huge_ins_b); > >>> continue; > >>> } > >>> -- > >>> 2.41.0.640.ga95def55d0-goog > >>> > >> > > > > > > -- > > Thanks, > > Mina > > > -- Thanks, Mina