(answered inline below) On 07/04/2023 03.46, Qiongwen Xu wrote:
Dear XDP experts, I am a PhD student at Rutgers. Recently, I have been reading the XDP paper "The eXpress Data Path: Fast Programmable Packet Processingin the Operating System Kernel". In section 4.1 and 4.3, you mention the throughputs of xdp programs (packet drop and packet forwarding) are limited by the PCIe (e.g., "Both scale their performance linearlyuntil they approach the global performance limit of the PCI bus").
Most of the article[1][2] authors are likely this mailing list, including me. (Sad to see we called it "PCI *bus*" and not just PCIe).
I am curious about how you figured out it was the PCIe limitation.
It is worth noting that the PCIe limitation shown in article is related to number of PCIe transactions with small packets (Ethernet minimum frame size 64 Bytes). (Thus meaning NOT bandwidth related). The observations that lead to the PCIe limitation conclusion: A single CPU doing XDP_DROP (25Mpps) was using 100% CPU time (runtime attributed to ksoftirqd). When we scaled up XDP_DROP to run on more CPUs we saw something strange[3]. It scaled linear to 3 CPUs, and at 4 CPUs each CPU started to process less packets per sec (pps) and total (86Mpps) stayed the same. Even more strange the CPUs wasn't using 100% CPU any-longer, CPUs had "time" to idle. Looking at ethtool stats, we noticed the counter "rx_discards_phy", which (we were told) happens when PCIe causes backpressure. What confirmed the PCIe (transactions) bottleneck was[4] when we discovered enabling the mlx5 priv-flags rx_cqe_compress=on (and rx_striding_rq=off) changed the total limit (86Mpps to 108Mpps), as rx_cqe_compress reduce the transactions on PCIe by compressing the RX descriptors. Thus, confirming this was related to PCIe. > Is there any tool or method to check this? I *highly* recommend that you read this article [pci1][pci2]: - Title: "Understanding PCIe performance for end host networking" I wish we had read and referenced this article in ours (but both happened in 2018). They give a theoretical model for PCIe, both bandwidth and latency. That could be used to explain our PCIe observations. They also released their [pcie-bench] tool. I wish more (kernel) performance people understood, that PCIe is a protocol (3-layers: physical, data link layer (DLL) and Transaction Layer Packets (TLP)), that is used between the device and host OS-driver. In networking usually ignores this PCIe protocol step, with associated protocol overheads, which actually causes a network packet to be split into smaller PCIe TLP "packets" with their own PCIe level headers. Besides the packet data itself, the PCIe protocol is used for reading TX desc (seen from device) and writing RX desc (seen from device), and read/update queue pointers. It might surprise people that article [pci1] shows, that PCIe (128B payload) introduces a latency around 600ns (nanosec), which is significantly larger than the inter-packet gap needed for wirespeed networking. Thus, latency hiding happens "behind our back", via the device and DMA engine have to keep many transactions in-flight to utilize the NIC (yet another hidden queue in the system). --Jesper Links: [1] https://dl.acm.org/doi/10.1145/3281411.3281443 [2] https://github.com/xdp-project/xdp-paper[3] https://github.com/xdp-project/xdp-paper/blob/master/benchmarks/bench02_xdp_drop.org#possible-pcie-limit [4] https://github.com/xdp-project/xdp-paper/blob/master/benchmarks/bench01_baseline.org#initial-data-from-jespers-runs
Read this article: [pci0] https://dl.acm.org/doi/10.1145/3230543.3230560[pci1] https://www.cl.cam.ac.uk/research/srg/netos/projects/pcie-bench/neugebauer2018understanding.pdf
[pci2] https://www.cl.cam.ac.uk/research/srg/netos/projects/pcie-bench/ [pcie-bench] https://github.com/pcie-bench/pcie-model