On 2024-02-20 13:57:24 [+0100], Jesper Dangaard Brouer wrote: > > so I replaced nr_cpu_ids with 64 and bootet maxcpus=64 so that I can run > > xdp-bench on the ixgbe. > > > > Yes, ixgbe HW have limited TX queues, and XDP tries to allocate a > hardware TX queue for every CPU in the system. So, I guess you have too > many CPUs in your system - lol. > > Other drivers have a fallback to a locked XDP TX path, so this is also > something to lookout for in the machine with i40e. this locked XDP TX path starts at 64 but xdp_progs are rejected > 64 * 2. > > so. i40 send, ixgbe receive. > > > > -t 2 > > > > | Summary 2,348,800 rx/s 0 err/s > > | receive total 2,348,800 pkt/s 2,348,800 drop/s 0 error/s > > | cpu:0 2,348,800 pkt/s 2,348,800 drop/s 0 error/s > > | xdp_exception 0 hit/s > > > > This is way too low, with i40e sending. > > On my system with only -t 1 my i40e driver can send with approx 15Mpps: > > Ethtool(i40e2) stat: 15028585 ( 15,028,585) <= tx-0.packets /sec > Ethtool(i40e2) stat: 15028589 ( 15,028,589) <= tx_packets /sec -t1 in ixgbe Show adapter(s) (eth1) statistics (ONLY that changed!) Ethtool(eth1 ) stat: 107857263 ( 107,857,263) <= tx_bytes /sec Ethtool(eth1 ) stat: 115047684 ( 115,047,684) <= tx_bytes_nic /sec Ethtool(eth1 ) stat: 1797621 ( 1,797,621) <= tx_packets /sec Ethtool(eth1 ) stat: 1797636 ( 1,797,636) <= tx_pkts_nic /sec Ethtool(eth1 ) stat: 107857263 ( 107,857,263) <= tx_queue_0_bytes /sec Ethtool(eth1 ) stat: 1797621 ( 1,797,621) <= tx_queue_0_packets /sec -t i40e Ethtool(eno2np1 ) stat: 90 ( 90) <= port.rx_bytes /sec Ethtool(eno2np1 ) stat: 1 ( 1) <= port.rx_size_127 /sec Ethtool(eno2np1 ) stat: 1 ( 1) <= port.rx_unicast /sec Ethtool(eno2np1 ) stat: 79554379 ( 79,554,379) <= port.tx_bytes /sec Ethtool(eno2np1 ) stat: 1243037 ( 1,243,037) <= port.tx_size_64 /sec Ethtool(eno2np1 ) stat: 1243037 ( 1,243,037) <= port.tx_unicast /sec Ethtool(eno2np1 ) stat: 86 ( 86) <= rx-32.bytes /sec Ethtool(eno2np1 ) stat: 1 ( 1) <= rx-32.packets /sec Ethtool(eno2np1 ) stat: 86 ( 86) <= rx_bytes /sec Ethtool(eno2np1 ) stat: 1 ( 1) <= rx_cache_waive /sec Ethtool(eno2np1 ) stat: 1 ( 1) <= rx_packets /sec Ethtool(eno2np1 ) stat: 1 ( 1) <= rx_unicast /sec Ethtool(eno2np1 ) stat: 74580821 ( 74,580,821) <= tx-0.bytes /sec Ethtool(eno2np1 ) stat: 1243014 ( 1,243,014) <= tx-0.packets /sec Ethtool(eno2np1 ) stat: 74580821 ( 74,580,821) <= tx_bytes /sec Ethtool(eno2np1 ) stat: 1243014 ( 1,243,014) <= tx_packets /sec Ethtool(eno2np1 ) stat: 1243037 ( 1,243,037) <= tx_unicast /sec mine is slightly slower. But this seems to match what I see on the RX side. > At this level, if you can verify that CPU:60 is 100% loaded, and packet > generator is sending more than rx number, then it could work as a valid > experiment. i40e receiving on 8: %Cpu8 : 0.0 us, 0.0 sy, 0.0 ni, 84.8 id, 0.0 wa, 0.0 hi, 15.2 si, 0.0 st ixgbe receiving on 13: %Cpu13 : 0.0 us, 0.0 sy, 0.0 ni, 56.7 id, 0.0 wa, 0.0 hi, 43.3 si, 0.0 st looks idle. On the sending side kpktgend_0 is always at 100%. > > -t 18 > > | Summary 7,784,946 rx/s 0 err/s > > | receive total 7,784,946 pkt/s 7,784,946 drop/s 0 error/s > > | cpu:60 7,784,946 pkt/s 7,784,946 drop/s 0 error/s > > | xdp_exception 0 hit/s > > > > after t18 it drop down to 2,… > > Now I got worse than before since -t8 says 7,5… and it did 8,4 in the > > morning. Do you have maybe a .config for me in case I did not enable the > > performance switch? > > > > I would look for root-cause with perf record + > perf report --sort cpu,comm,dso,symbol --no-children while sending with ixgbe while running perf top on the box: | Samples: 621K of event 'cycles', 4000 Hz, Event count (approx.): 49979376685 lost: 0/0 drop: 0/0 | Overhead CPU Command Shared Object Symbol | 31.98% 000 kpktgend_0 [kernel] [k] xas_find | 6.72% 000 kpktgend_0 [kernel] [k] pfn_to_dma_pte | 5.63% 000 kpktgend_0 [kernel] [k] ixgbe_xmit_frame_ring | 4.78% 000 kpktgend_0 [kernel] [k] dma_pte_clear_level | 3.16% 000 kpktgend_0 [kernel] [k] __iommu_dma_unmap | 2.30% 000 kpktgend_0 [kernel] [k] fq_ring_free_locked | 1.99% 000 kpktgend_0 [kernel] [k] __domain_mapping | 1.82% 000 kpktgend_0 [kernel] [k] iommu_dma_alloc_iova | 1.80% 000 kpktgend_0 [kernel] [k] __iommu_map | 1.72% 000 kpktgend_0 [kernel] [k] iommu_pgsize.isra.0 | 1.70% 000 kpktgend_0 [kernel] [k] __iommu_dma_map | 1.63% 000 kpktgend_0 [kernel] [k] alloc_iova_fast | 1.59% 000 kpktgend_0 [kernel] [k] _raw_spin_lock_irqsave | 1.32% 000 kpktgend_0 [kernel] [k] iommu_map | 1.30% 000 kpktgend_0 [kernel] [k] iommu_dma_map_page | 1.23% 000 kpktgend_0 [kernel] [k] intel_iommu_iotlb_sync_map | 1.21% 000 kpktgend_0 [kernel] [k] xa_find_after | 1.17% 000 kpktgend_0 [kernel] [k] ixgbe_poll | 1.06% 000 kpktgend_0 [kernel] [k] __iommu_unmap | 1.04% 000 kpktgend_0 [kernel] [k] intel_iommu_unmap_pages | 1.01% 000 kpktgend_0 [kernel] [k] free_iova_fast | 0.96% 000 kpktgend_0 [pktgen] [k] pktgen_thread_worker the i40e box while sending: |Samples: 400K of event 'cycles:P', 4000 Hz, Event count (approx.): 80512443924 lost: 0/0 drop: 0/0 |Overhead CPU Command Shared Object Symbol | 24.04% 000 kpktgend_0 [kernel] [k] i40e_lan_xmit_frame | 17.20% 019 swapper [kernel] [k] i40e_napi_poll | 4.84% 019 swapper [kernel] [k] intel_idle_irq | 4.20% 019 swapper [kernel] [k] napi_consume_skb | 3.00% 000 kpktgend_0 [pktgen] [k] pktgen_thread_worker | 2.76% 008 swapper [kernel] [k] i40e_napi_poll | 2.36% 000 kpktgend_0 [kernel] [k] dma_map_page_attrs | 1.93% 019 swapper [kernel] [k] dma_unmap_page_attrs | 1.70% 008 swapper [kernel] [k] intel_idle_irq | 1.44% 008 swapper [kernel] [k] __udp4_lib_rcv | 1.44% 008 swapper [kernel] [k] __netif_receive_skb_core.constprop.0 | 1.40% 008 swapper [kernel] [k] napi_build_skb | 1.28% 000 kpktgend_0 [kernel] [k] kfree_skb_reason | 1.27% 008 swapper [kernel] [k] ip_rcv_core | 1.19% 008 swapper [kernel] [k] inet_gro_receive | 1.01% 008 swapper [kernel] [k] kmem_cache_free.part.0 > --Jesper Sebastian