This is a follow-up on https://lore.kernel.org/all/20250213093925.x_ggH1aj@xxxxxxxxxxxxx/ to convert the page_pool statistics to u64_stats_t to avoid u64 related problems on 32bit architectures. While looking over it, the comment for recycle_stat_inc() says that it is safe to use in preemptible context. The 32bit update is split into two 32bit writes. This is "okay" if the value observed from the current CPU but cross CPU reads may observe inconsistencies if the lower part overflows and the upper part is not yet written. I explained this and added x86-32 assembly in https://lore.kernel.org/all/20250226102703.3F7Aa2oK@xxxxxxxxxxxxx/ I don't know if it is ensured that only *one* update can happen because the stats are per-CPU and per NAPI device. But there will be now a warning on 32bit if this is really attempted in preemptible context. The placement of the counters is not affected by this change except on 32bit where an additional sync member is added. For 64bit pahole output changes from | struct page_pool_recycle_stats { | u64 cached; /* 0 8 */ | u64 cache_full; /* 8 8 */ | u64 ring; /* 16 8 */ | u64 ring_full; /* 24 8 */ | u64 released_refcnt; /* 32 8 */ | | /* size: 40, cachelines: 1, members: 5 */ | /* last cacheline: 40 bytes */ | }; to | struct page_pool_recycle_stats { | struct u64_stats_sync syncp; /* 0 0 */ | u64_stats_t cached; /* 0 8 */ | u64_stats_t cache_full; /* 8 8 */ | u64_stats_t ring; /* 16 8 */ | u64_stats_t ring_full; /* 24 8 */ | u64_stats_t released_refcnt; /* 32 8 */ | | /* size: 40, cachelines: 1, members: 6 */ | /* last cacheline: 40 bytes */ | }; On 32bit struct u64_stats_sync grows by 4 bytes (plus addiional 20 with lockdep). For bench_page_pool_simple.ko loops=600000000 I neded up with, before: | time_bench: Type:for_loop Per elem: 1 cycles(tsc) 0.501 ns (step:0) | time_bench: Type:atomic_inc Per elem: 6 cycles(tsc) 3.303 ns (step:0) | time_bench: Type:lock Per elem: 28 cycles(tsc) 14.038 ns (step:0) | time_bench: Type:u64_stats_inc Per elem: 1 cycles(tsc) 0.565 ns (step:0) | time_bench: Type:this_cpu_inc Per elem: 1 cycles(tsc) 0.503 ns (step:0) | | bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path | time_bench: Type:no-softirq-page_pool01 Per elem: 19 cycles(tsc) 9.526 ns (step:0) | bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path | time_bench: Type:no-softirq-page_pool02 Per elem: 46 cycles(tsc) 23.501 ns (step:0) | bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path | time_bench: Type:no-softirq-page_pool03 Per elem: 121 cycles(tsc) 60.697 ns (step:0) | bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path | bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path | time_bench: Type:tasklet_page_pool01_fast_path Per elem: 19 cycles(tsc) 9.531 ns (step:0) | bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path | time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 45 cycles(tsc) 22.594 ns (step:0) | bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path | time_bench: Type:tasklet_page_pool03_slow Per elem: 123 cycles(tsc) 61.969 ns (step:0) after: | time_bench: Type:for_loop Per elem: 1 cycles(tsc) 0.501 ns (step:0) | time_bench: Type:atomic_inc Per elem: 6 cycles(tsc) 3.324 ns (step:0) | time_bench: Type:lock Per elem: 28 cycles(tsc) 14.038 ns (step:0) | time_bench: Type:u64_stats_inc Per elem: 1 cycles(tsc) 0.565 ns (step:0) | time_bench: Type:this_cpu_inc Per elem: 1 cycles(tsc) 0.506 ns (step:0) | | bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path | time_bench: Type:no-softirq-page_pool01 Per elem: 18 cycles(tsc) 9.028 ns (step:0) | bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path | time_bench: Type:no-softirq-page_pool02 Per elem: 45 cycles(tsc) 22.714 ns (step:0) | bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path | time_bench: Type:no-softirq-page_pool03 Per elem: 120 cycles(tsc) 60.428 ns (step:0) | bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path | bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path | time_bench: Type:tasklet_page_pool01_fast_path Per elem: 18 cycles(tsc) 9.024 ns (step:0) | bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path | time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 43 cycles(tsc) 22.028 ns (step:0) | bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path | time_bench: Type:tasklet_page_pool03_slow Per elem: 121 cycles(tsc) 60.736 ns (step:0) v1…v2: https://lore.kernel.org/all/20250221115221.291006-1-bigeasy@xxxxxxxxxxxxx - Clarified the cover mail, added stat for pahole and from bench_page_pool_simple.ko - Corrected page_pool_alloc_stats vs page_pool_recycle_stats type in the last patch. - Copy the counter values outside of the do {} while loop and add them later. - Redid the mlnx5 patch to make it use generic infrastructure which is now extended as part of this series. Sebastian Andrzej Siewior (5): page_pool: Provide an empty page_pool_stats for disabled stats. page_pool: Add per-queue statistics. mlx5: Use generic code for page_pool statistics. page_pool: Convert page_pool_recycle_stats to u64_stats_t. page_pool: Convert page_pool_alloc_stats to u64_stats_t. Documentation/networking/page_pool.rst | 6 +- .../ethernet/mellanox/mlx5/core/en_stats.c | 87 +++---------- .../ethernet/mellanox/mlx5/core/en_stats.h | 30 +---- include/linux/u64_stats_sync.h | 5 + include/net/page_pool/helpers.h | 11 ++ include/net/page_pool/types.h | 31 +++-- net/core/page_pool.c | 122 ++++++++++++++---- net/core/page_pool_user.c | 22 ++-- 8 files changed, 163 insertions(+), 151 deletions(-) -- 2.47.2