> > On Sun, 7 Aug 2016 19:15:27 -0700 Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: [...] > > > could you please share the performance numbers for mlx5 order-0 vs order-N ? > > > You mentioned that there was some performance improvement. We need to know > > > how much we'll lose when we turn off order-N. There is an really easy way (after XDP) to benchmark this order-0 vs order-N, for the driver mlx4. I simply load a XDP program, that returns XDP_PASS, because loading XDP will reallocate the RX rings to use a single frame/packet and order-0 pages (for RX ring slots). Result summary: (order-3 pages) 4,453,022 -> (XDP_PASS) 3,295,798 pps * 3295798 - 4453022 = -1157224 pps slower * (3295798/4453022-1)*100 = -25.98% slower * (1/4453022-1/3295798)*10^9 - -78.85 nanosec slower * Approx convert nanosec to cycles (78.85 * 4GHz) = 315 cycles slower Where does this performance regression originate from. Well, this basically only changed the page allocation strategy and number of DMA calls in the driver. Thus, lets look at the performance of the page allocator (see tool Page_bench_ and MM_slides_ page 9) On this machine: * Cost of order-0: 237 cycles(tsc) 59.336 ns * Cost of order-3: 423 cycles(tsc) 106.029 ns The order-3 cost is amortized, as it can store 21 frames of size 1536, to cost per page-fragment 20 cycles / 5.049 ns. Thus, I would expect to see a (59.336-5.049) 54.287 ns performance reduction, not 78.85, which is 24.563 ns higher than expected (extra dma maps cannot explain this on a Intel platform). There is a higher percentage of L3/LLC-load-misses, which is strange, as I though the simple XDP (inc map cnt and return XDP_PASS) program should not touch the data. Quick experiment with xdp-prog that touch data like xdp1 and always return XDP_PASS, show 3209235 with is only 8ns slower ((1/3209235-1/3295798)*10^9 = 8.184 ns). Thus, the extra 24ns (or 16ns) might originate from an earlier cache-miss. Conclusion: These measurements confirm that we need a page recycle facility for the drivers before switching to order-0 allocations. Links: .. _Page_bench: https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c .. _MM_slides: http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.odp Benchmarking notes and perf results below: Base setup: * Drop packets in iptables RAW * Disable Ethernet flow control * Disable GRO (changes driver code path) * Mlx4 NIC CX3-pro (mlx4_core log_num_mgm_entry_size=-2) * CPU: i7-4790K CPU @ 4.00GHz (turbostat report 4.3GHz) Baseline: 4.7.0-baseline+ #102 SMP PREEMPT * instant rx:4558943 tx:0 pps n:162 average: rx:4453022 tx:0 pps (instant variation TX 0.000 ns (min:0.000 max:0.000) RX 5.217 ns) Baseline perf stat:: $ sudo perf stat -C3 -e L1-icache-load-misses -e cycles:k -e instructions:k -e cache-misses:k -e cache-references:k -e LLC-store-misses:k -e LLC-store -e LLC-load-misses:k -e LLC-load -r 5 sleep 1 Performance counter stats for 'CPU(s) 3' (5 runs) :: 271,417 L1-icache-load-misses ( +- 0.69% ) (33.32%) 4,383,371,009 cycles:k ( +- 0.01% ) (44.51%) 7,587,502,193 instructions:k # 1.50 insns per cycle (+- 0.01% )(55.62%) 5,856,640 cache-misses:k # 48.435 % of all cache refs (+- 0.01% )(66.72%) 12,091,854 cache-references:k ( +- 0.04%)(66.72%) 451,681 LLC-store-misses ( +- 0.13%)(66.72%) 463,152 LLC-store ( +- 0.12%)(66.68%) 5,408,934 LLC-load-misses # 47.26% of all LL-cache hits (0.01%) (22.19%) 11,446,060 LLC-load ( +- 0.04%) (22.19%) Samples: 40K of event 'cycles', Event count (approx.): 43956150960 :: Overhead Command Shared Object Symbol + 36.59% ksoftirqd/3 [kernel.vmlinux] [k] memcpy_erms + 6.76% ksoftirqd/3 [mlx4_en] [k] mlx4_en_process_rx_cq + 6.66% ksoftirqd/3 [ip_tables] [k] ipt_do_table + 6.03% ksoftirqd/3 [kernel.vmlinux] [k] __build_skb + 4.65% ksoftirqd/3 [kernel.vmlinux] [k] ip_rcv + 4.22% ksoftirqd/3 [mlx4_en] [k] mlx4_en_prepare_rx_desc + 3.46% ksoftirqd/3 [mlx4_en] [k] mlx4_en_free_frag + 3.37% ksoftirqd/3 [kernel.vmlinux] [k] __netif_receive_skb_core + 3.04% ksoftirqd/3 [kernel.vmlinux] [k] __netdev_alloc_skb + 2.80% ksoftirqd/3 [kernel.vmlinux] [k] kmem_cache_alloc + 2.38% ksoftirqd/3 [kernel.vmlinux] [k] __free_page_frag + 1.88% ksoftirqd/3 [kernel.vmlinux] [k] kmem_cache_free + 1.65% ksoftirqd/3 [kernel.vmlinux] [k] nf_iterate + 1.59% ksoftirqd/3 [kernel.vmlinux] [k] nf_hook_slow + 1.31% ksoftirqd/3 [kernel.vmlinux] [k] __rcu_read_unlock + 0.91% ksoftirqd/3 [kernel.vmlinux] [k] __alloc_page_frag + 0.88% ksoftirqd/3 [kernel.vmlinux] [k] eth_type_trans + 0.77% ksoftirqd/3 [kernel.vmlinux] [k] dev_gro_receive + 0.76% ksoftirqd/3 [kernel.vmlinux] [k] skb_release_data + 0.76% ksoftirqd/3 [kernel.vmlinux] [k] __local_bh_enable_ip + 0.72% ksoftirqd/3 [kernel.vmlinux] [k] netif_receive_skb_internal + 0.66% ksoftirqd/3 [kernel.vmlinux] [k] napi_gro_receive + 0.66% ksoftirqd/3 [kernel.vmlinux] [k] __rcu_read_lock + 0.65% ksoftirqd/3 [kernel.vmlinux] [k] skb_release_head_state + 0.57% ksoftirqd/3 [kernel.vmlinux] [k] get_page_from_freelist + 0.57% ksoftirqd/3 [kernel.vmlinux] [k] __free_pages_ok + 0.51% ksoftirqd/3 [kernel.vmlinux] [k] kfree_skb + 0.43% ksoftirqd/3 [kernel.vmlinux] [k] skb_release_all Result-xdp-pass: loading XDP_PASS program * instant rx:3374269 tx:0 pps n:537 average: rx:3295798 tx:0 pps (instant variation TX 0.000 ns (min:0.000 max:0.000) RX 7.056 ns) Difference: 4,453,022 -> 3,295,798 pps * 3295798 - 4453022 = -1157224 pps slower * (3295798/4453022-1)*100 = -25.98% slower * (1/4453022-1/3295798)*10^9 - -78.85 nanosec slower Perf stats xdp-pass:: Performance counter stats for 'CPU(s) 3' (5 runs): 294,219 L1-icache-load-misses (+-0.25% ) (33.33%) 4,382,764,897 cycles:k (+-0.00% ) (44.51%) 7,223,252,624 instructions:k # 1.65 insns per cycle (+-0.00%)(55.62%) 7,166,907 cache-misses:k # 58.792 % of all cache refs (+-0.01%)(66.72%) 12,190,275 cache-references:k (+-0.03% ) (66.72%) 525,262 LLC-store-misses (+-0.11% ) (66.72%) 587,354 LLC-store (+-0.09% ) (66.68%) 6,647,957 LLC-load-misses # 58.23% of all LL-cache hits (+-0.02%)(22.19%) 11,417,001 LLC-load (+-0.03%)(22.19%) There is a higher percentage of L3/LLC-load-misses, which is strange, as I though the simple XDP (return XDP_PASS and inc map cnt) program would not touch the data. Perf report xdp-pass:: Samples: 40K of event 'cycles', Event count (approx.): 43953682891 Overhead Command Shared Object Symbol + 25.79% ksoftirqd/3 [kernel.vmlinux] [k] memcpy_erms + 7.29% ksoftirqd/3 [mlx4_en] [k] mlx4_en_process_rx_cq + 5.42% ksoftirqd/3 [mlx4_en] [k] mlx4_en_free_frag + 5.16% ksoftirqd/3 [kernel.vmlinux] [k] get_page_from_freelist + 4.55% ksoftirqd/3 [ip_tables] [k] ipt_do_table + 4.46% ksoftirqd/3 [mlx4_en] [k] mlx4_alloc_pages.isra.19 + 3.97% ksoftirqd/3 [kernel.vmlinux] [k] __build_skb + 3.67% ksoftirqd/3 [kernel.vmlinux] [k] free_hot_cold_page + 3.46% ksoftirqd/3 [kernel.vmlinux] [k] ip_rcv + 2.71% ksoftirqd/3 [kernel.vmlinux] [k] __alloc_pages_nodemask + 2.62% ksoftirqd/3 [kernel.vmlinux] [k] __netif_receive_skb_core + 2.46% ksoftirqd/3 [kernel.vmlinux] [k] kmem_cache_alloc + 2.24% ksoftirqd/3 [kernel.vmlinux] [k] __netdev_alloc_skb + 2.15% ksoftirqd/3 [mlx4_en] [k] mlx4_en_prepare_rx_desc + 1.88% ksoftirqd/3 [kernel.vmlinux] [k] __free_page_frag + 1.55% ksoftirqd/3 [kernel.vmlinux] [k] kmem_cache_free + 1.42% ksoftirqd/3 [kernel.vmlinux] [k] __rcu_read_unlock + 1.27% ksoftirqd/3 [kernel.vmlinux] [k] nf_iterate + 1.14% ksoftirqd/3 [kernel.vmlinux] [k] nf_hook_slow + 1.05% ksoftirqd/3 [kernel.vmlinux] [k] alloc_pages_current + 0.83% ksoftirqd/3 [kernel.vmlinux] [k] __inc_zone_state + 0.73% ksoftirqd/3 [kernel.vmlinux] [k] __list_del_entry + 0.69% ksoftirqd/3 [kernel.vmlinux] [k] __list_add + 0.64% ksoftirqd/3 [kernel.vmlinux] [k] __local_bh_enable_ip + 0.64% ksoftirqd/3 [kernel.vmlinux] [k] __rcu_read_lock + 0.62% ksoftirqd/3 [kernel.vmlinux] [k] dev_gro_receive + 0.62% ksoftirqd/3 [kernel.vmlinux] [k] swiotlb_map_page + 0.61% ksoftirqd/3 [kernel.vmlinux] [k] skb_release_data + 0.60% ksoftirqd/3 [kernel.vmlinux] [k] __alloc_page_frag + 0.58% ksoftirqd/3 [kernel.vmlinux] [k] eth_type_trans + 0.57% ksoftirqd/3 [kernel.vmlinux] [k] policy_zonelist + 0.51% ksoftirqd/3 [pps_core] [k] 0x000000000000692d + 0.51% ksoftirqd/3 [kernel.vmlinux] [k] netif_receive_skb_internal + 0.50% ksoftirqd/3 [kernel.vmlinux] [k] napi_gro_receive + 0.49% ksoftirqd/3 [kernel.vmlinux] [k] __put_page + 0.49% ksoftirqd/3 [kernel.vmlinux] [k] skb_release_head_state + 0.42% ksoftirqd/3 [kernel.vmlinux] [k] kfree_skb + 0.34% ksoftirqd/3 [pps_core] [k] 0x0000000000006935 + 0.33% ksoftirqd/3 [kernel.vmlinux] [k] skb_free_head + 0.32% ksoftirqd/3 [kernel.vmlinux] [k] __netif_receive_skb + 0.31% ksoftirqd/3 [kernel.vmlinux] [k] swiotlb_sync_single + 0.31% ksoftirqd/3 [kernel.vmlinux] [k] skb_gro_reset_offset + 0.29% ksoftirqd/3 [kernel.vmlinux] [k] swiotlb_sync_single_for_cpu + 0.29% ksoftirqd/3 [kernel.vmlinux] [k] list_del + 0.27% ksoftirqd/3 [iptable_raw] [k] iptable_raw_hook + 0.27% ksoftirqd/3 [kernel.vmlinux] [k] skb_release_all + 0.26% ksoftirqd/3 [kernel.vmlinux] [k] kfree_skbmem + 0.25% ksoftirqd/3 [kernel.vmlinux] [k] swiotlb_unmap_page + 0.23% ksoftirqd/3 [kernel.vmlinux] [k] bpf_map_lookup_elem + 0.22% ksoftirqd/3 [kernel.vmlinux] [k] percpu_array_map_lookup_elem + 0.20% ksoftirqd/3 [kernel.vmlinux] [k] __page_cache_release In perf-diff notice the increase for: * get_page_from_freelist(0.57%) +4.59%, * mlx4_en_free_frag (3.46%) +1.96%, * mlx4_alloc_pages (0.26%) +4.20% * __alloc_pages_nodemask(0.14%) +2.57% * swiotlb_map_page (0.04%) +0.57% Perf diff:: # Baseline Delta Shared Object Symbol # ........ ....... ................... ................................ # 36.59% -10.80% [kernel.vmlinux] [k] memcpy_erms 6.76% +0.53% [mlx4_en] [k] mlx4_en_process_rx_cq 6.66% -2.11% [ip_tables] [k] ipt_do_table 6.03% -2.06% [kernel.vmlinux] [k] __build_skb 4.65% -1.18% [kernel.vmlinux] [k] ip_rcv 4.22% -2.06% [mlx4_en] [k] mlx4_en_prepare_rx_desc 3.46% +1.96% [mlx4_en] [k] mlx4_en_free_frag 3.37% -0.75% [kernel.vmlinux] [k] __netif_receive_skb_core 3.04% -0.80% [kernel.vmlinux] [k] __netdev_alloc_skb 2.80% -0.34% [kernel.vmlinux] [k] kmem_cache_alloc 2.38% -0.50% [kernel.vmlinux] [k] __free_page_frag 1.88% -0.34% [kernel.vmlinux] [k] kmem_cache_free 1.65% -0.38% [kernel.vmlinux] [k] nf_iterate 1.59% -0.45% [kernel.vmlinux] [k] nf_hook_slow 1.31% +0.11% [kernel.vmlinux] [k] __rcu_read_unlock 0.91% -0.31% [kernel.vmlinux] [k] __alloc_page_frag 0.88% -0.30% [kernel.vmlinux] [k] eth_type_trans 0.77% -0.15% [kernel.vmlinux] [k] dev_gro_receive 0.76% -0.15% [kernel.vmlinux] [k] skb_release_data 0.76% -0.12% [kernel.vmlinux] [k] __local_bh_enable_ip 0.72% -0.21% [kernel.vmlinux] [k] netif_receive_skb_internal 0.66% -0.16% [kernel.vmlinux] [k] napi_gro_receive 0.66% -0.02% [kernel.vmlinux] [k] __rcu_read_lock 0.65% -0.17% [kernel.vmlinux] [k] skb_release_head_state 0.57% +4.59% [kernel.vmlinux] [k] get_page_from_freelist 0.57% [kernel.vmlinux] [k] __free_pages_ok 0.51% -0.09% [kernel.vmlinux] [k] kfree_skb 0.43% -0.15% [kernel.vmlinux] [k] skb_release_all 0.42% -0.11% [kernel.vmlinux] [k] skb_gro_reset_offset 0.41% -0.08% [kernel.vmlinux] [k] skb_free_head 0.39% -0.07% [kernel.vmlinux] [k] __netif_receive_skb 0.36% -0.08% [iptable_raw] [k] iptable_raw_hook 0.34% -0.08% [kernel.vmlinux] [k] kfree_skbmem 0.28% +0.01% [kernel.vmlinux] [k] swiotlb_sync_single_for_cpu 0.26% +4.20% [mlx4_en] [k] mlx4_alloc_pages.isra.19 0.20% +0.11% [kernel.vmlinux] [k] swiotlb_sync_single 0.15% -0.03% [kernel.vmlinux] [k] __do_softirq 0.14% +2.57% [kernel.vmlinux] [k] __alloc_pages_nodemask 0.14% [kernel.vmlinux] [k] free_one_page 0.13% -0.13% [kernel.vmlinux] [k] _raw_spin_lock_irqsave 0.13% -0.12% [kernel.vmlinux] [k] _raw_spin_lock 0.10% [kernel.vmlinux] [k] __mod_zone_page_state 0.09% +0.06% [kernel.vmlinux] [k] net_rx_action 0.09% [kernel.vmlinux] [k] __rmqueue 0.07% [kernel.vmlinux] [k] __zone_watermark_ok 0.07% [kernel.vmlinux] [k] PageHuge 0.06% +0.77% [kernel.vmlinux] [k] __inc_zone_state 0.76% -0.15% [kernel.vmlinux] [k] skb_release_data 0.76% -0.12% [kernel.vmlinux] [k] __local_bh_enable_ip 0.72% -0.21% [kernel.vmlinux] [k] netif_receive_skb_internal 0.66% -0.16% [kernel.vmlinux] [k] napi_gro_receive 0.66% -0.02% [kernel.vmlinux] [k] __rcu_read_lock 0.65% -0.17% [kernel.vmlinux] [k] skb_release_head_state 0.57% +4.59% [kernel.vmlinux] [k] get_page_from_freelist 0.57% [kernel.vmlinux] [k] __free_pages_ok 0.51% -0.09% [kernel.vmlinux] [k] kfree_skb 0.43% -0.15% [kernel.vmlinux] [k] skb_release_all 0.42% -0.11% [kernel.vmlinux] [k] skb_gro_reset_offset 0.41% -0.08% [kernel.vmlinux] [k] skb_free_head 0.39% -0.07% [kernel.vmlinux] [k] __netif_receive_skb 0.36% -0.08% [iptable_raw] [k] iptable_raw_hook 0.34% -0.08% [kernel.vmlinux] [k] kfree_skbmem 0.28% +0.01% [kernel.vmlinux] [k] swiotlb_sync_single_for_cpu 0.26% +4.20% [mlx4_en] [k] mlx4_alloc_pages.isra.19 0.20% +0.11% [kernel.vmlinux] [k] swiotlb_sync_single 0.15% -0.03% [kernel.vmlinux] [k] __do_softirq 0.14% +2.57% [kernel.vmlinux] [k] __alloc_pages_nodemask 0.14% [kernel.vmlinux] [k] free_one_page 0.13% -0.13% [kernel.vmlinux] [k] _raw_spin_lock_irqsave 0.13% -0.12% [kernel.vmlinux] [k] _raw_spin_lock 0.10% [kernel.vmlinux] [k] __mod_zone_page_state 0.09% +0.06% [kernel.vmlinux] [k] net_rx_action 0.09% [kernel.vmlinux] [k] __rmqueue 0.07% [kernel.vmlinux] [k] __zone_watermark_ok 0.07% [kernel.vmlinux] [k] PageHuge 0.06% +0.77% [kernel.vmlinux] [k] __inc_zone_state 0.06% +0.98% [kernel.vmlinux] [k] alloc_pages_current 0.06% +0.51% [kernel.vmlinux] [k] policy_zonelist 0.06% +0.01% [kernel.vmlinux] [k] delay_tsc 0.05% -0.00% [mlx4_en] [k] mlx4_en_poll_rx_cq 0.05% +0.01% [kernel.vmlinux] [k] __memcpy 0.04% +0.57% [kernel.vmlinux] [k] swiotlb_map_page 0.04% +0.69% [kernel.vmlinux] [k] __list_del_entry 0.04% [kernel.vmlinux] [k] free_compound_page 0.04% [kernel.vmlinux] [k] __put_compound_page 0.03% +0.66% [kernel.vmlinux] [k] __list_add -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>