Hi Andi and Acme, Regarding below discussion and subj (top-posting as you don't need to read discussion to answer my perf questions). Can we somehow use perf to profile things happening in PCIe ? E.g. Are there any PMU counters "uncore" events for PCIe ? Hint, we can list more PMU counter via Andi's ocperf tool[42]. # sudo ./ocperf list Could we use the TopDown [toplev] model, to indicate/detect that the PCIe device (or PCIe root complex) is the bottleneck? Hint, try out the [toplev] tool looking at specific core under-load # sudo ./toplev.py -I 3000 -l3 -a --show-sample --core C2 --Jesper [toplev] https://github.com/andikleen/pmu-tools/wiki/toplev-manual [42] https://github.com/andikleen/pmu-tools On 13/04/2023 04.54, Qiongwen Xu wrote:
Hi Jesper, Thanks for the detailed reply and sharing these helpful materials/papers with us! After enabling rx_cqe_compress, the throughput in our experiment increases from 70+Mpps to 85 Mpps. We also tried to use the counter "rx_discards_phy". The counter increases in both cpu-limited and pcie-limited experiments, i.e., in the experiment which is only cpu-limited can also increase the counter. We are looking for any counter that can separate cpu- and pcie-limited cases. Regarding the [pcie-bench] tool, unfortunately, we are not able to use it, as it requires fpga hardware. Thanks, Qiongwen From: Jesper Dangaard Brouer <jbrouer@xxxxxxxxxx> Date: Sunday, April 9, 2023 at 11:46 AM Subject: Re: Question about xdp: how to figure out the throughput is limited by pcie (answered inline below) On 07/04/2023 03.46, Qiongwen Xu wrote:Dear XDP experts, I am a PhD student at Rutgers. Recently, I have been reading the XDP paper "The eXpress Data Path: Fast Programmable Packet Processing in the Operating System Kernel". In section 4.1 and 4.3, you mention the throughputs of xdp programs (packet drop and packet forwarding) are limited by the PCIe (e.g., "Both scale their performance linearly until they approach the global performance limit of the PCI bus").Most of the article[1][2] authors are likely this mailing list, including me. (Sad to see we called it "PCI *bus*" and not just PCIe).I am curious about how you figured out it was the PCIe limitation.It is worth noting that the PCIe limitation shown in article is related to number of PCIe transactions with small packets (Ethernet minimum frame size 64 Bytes). (Thus meaning NOT bandwidth related). The observations that lead to the PCIe limitation conclusion: A single CPU doing XDP_DROP (25Mpps) was using 100% CPU time (runtime attributed to ksoftirqd). When we scaled up XDP_DROP to run on more CPUs we saw something strange[3]. It scaled linear to 3 CPUs, and at 4 CPUs each CPU started to process less packets per sec (pps) and total (86Mpps) stayed the same. Even more strange the CPUs wasn't using 100% CPU any-longer, CPUs had "time" to idle. Looking at ethtool stats, we noticed the counter "rx_discards_phy", which (we were told) happens when PCIe causes backpressure. What confirmed the PCIe (transactions) bottleneck was[4] when we discovered enabling the mlx5 priv-flags rx_cqe_compress=on (and rx_striding_rq=off) changed the total limit (86Mpps to 108Mpps), as rx_cqe_compress reduce the transactions on PCIe by compressing the RX descriptors. Thus, confirming this was related to PCIe. > Is there any tool or method to check this? I *highly* recommend that you read this article [pci1][pci2]: - Title: "Understanding PCIe performance for end host networking" I wish we had read and referenced this article in ours (but both happened in 2018). They give a theoretical model for PCIe, both bandwidth and latency. That could be used to explain our PCIe observations. They also released their [pcie-bench] tool. I wish more (kernel) performance people understood, that PCIe is a protocol (3-layers: physical, data link layer (DLL) and Transaction Layer Packets (TLP)), that is used between the device and host OS-driver. In networking usually ignores this PCIe protocol step, with associated protocol overheads, which actually causes a network packet to be split into smaller PCIe TLP "packets" with their own PCIe level headers. Besides the packet data itself, the PCIe protocol is used for reading TX desc (seen from device) and writing RX desc (seen from device), and read/update queue pointers. It might surprise people that article [pci1] shows, that PCIe (128B payload) introduces a latency around 600ns (nanosec), which is significantly larger than the inter-packet gap needed for wirespeed networking. Thus, latency hiding happens "behind our back", via the device and DMA engine have to keep many transactions in-flight to utilize the NIC (yet another hidden queue in the system). --Jesper Links: [1] https://dl.acm.org/doi/10.1145/3281411.3281443 [2] https://github.com/xdp-project/xdp-paper [3] https://github.com/xdp-project/xdp-paper/blob/master/benchmarks/bench02_xdp_drop.org [4] https://github.com/xdp-project/xdp-paper/blob/master/benchmarks/bench01_baseline.org Read this article: [pci0] https://dl.acm.org/doi/10.1145/3230543.3230560 [pci1] https://www.cl.cam.ac.uk/research/srg/netos/projects/pcie-bench/neugebauer2018understanding.pdf [pci2] https://www.cl.cam.ac.uk/research/srg/netos/projects/pcie-bench/ [pcie-bench] https://github.com/pcie-bench/pcie-model