Re: Question about xdp: how to figure out the throughput is limited by pcie

Jesper Dangaard Brouer <jbrouer@xxxxxxxxxx> · Sun, 9 Apr 2023 17:44:32 +0200

(answered inline below)

On 07/04/2023 03.46, Qiongwen Xu wrote:

Dear XDP experts,

I am a PhD student at Rutgers. Recently, I have been reading the XDP
paper "The eXpress Data Path: Fast Programmable Packet Processing

in the Operating System Kernel". In section 4.1 and 4.3, you mention 

the throughputs of xdp programs (packet drop and packet forwarding) 

are limited by the PCIe (e.g., "Both scale their performance linearly

until they approach the global performance limit of the PCI bus").

Most of the article[1][2] authors are likely this mailing list,
including me. (Sad to see we called it "PCI *bus*" and not just PCIe).

I am curious about how you figured out it was the PCIe limitation. 

It is worth noting that the PCIe limitation shown in article is related
to number of PCIe transactions with small packets (Ethernet minimum
frame size 64 Bytes). (Thus meaning NOT bandwidth related).

The observations that lead to the PCIe limitation conclusion:
A single CPU doing XDP_DROP (25Mpps) was using 100% CPU time (runtime
attributed to ksoftirqd).  When we scaled up XDP_DROP to run on more
CPUs we saw something strange[3].  It scaled linear to 3 CPUs, and at 4
CPUs each CPU started to process less packets per sec (pps) and total
(86Mpps) stayed the same.  Even more strange the CPUs wasn't using 100%
CPU any-longer, CPUs had "time" to idle.  Looking at ethtool stats, we
noticed the counter "rx_discards_phy", which (we were told) happens when
PCIe causes backpressure.

What confirmed the PCIe (transactions) bottleneck was[4] when we
discovered enabling the mlx5 priv-flags rx_cqe_compress=on (and
rx_striding_rq=off) changed the total limit (86Mpps to 108Mpps),
as rx_cqe_compress reduce the transactions on PCIe by compressing the RX
descriptors.  Thus, confirming this was related to PCIe.

> Is there any tool or method to check this?

I *highly* recommend that you read this article [pci1][pci2]:
 - Title: "Understanding PCIe performance for end host networking"

I wish we had read and referenced this article in ours (but both
happened in 2018).  They give a theoretical model for PCIe, both
bandwidth and latency.  That could be used to explain our PCIe
observations. They also released their [pcie-bench] tool.

I wish more (kernel) performance people understood, that PCIe is a
protocol (3-layers: physical, data link layer (DLL) and Transaction
Layer Packets (TLP)), that is used between the device and host
OS-driver.  In networking usually ignores this PCIe protocol step, with
associated protocol overheads, which actually causes a network packet to
be split into smaller PCIe TLP "packets" with their own PCIe level
headers. Besides the packet data itself, the PCIe protocol is used for
reading TX desc (seen from device) and writing RX desc (seen from
device), and read/update queue pointers.

It might surprise people that article [pci1] shows, that PCIe (128B
payload) introduces a latency around 600ns (nanosec), which is
significantly larger than the inter-packet gap needed for wirespeed
networking.  Thus, latency hiding happens "behind our back", via the
device and DMA engine have to keep many transactions in-flight to
utilize the NIC (yet another hidden queue in the system).

--Jesper

Links:

 [1] https://dl.acm.org/doi/10.1145/3281411.3281443
 [2] https://github.com/xdp-project/xdp-paper

 [3] 

https://github.com/xdp-project/xdp-paper/blob/master/benchmarks/bench02_xdp_drop.org#possible-pcie-limit

 [4] 

https://github.com/xdp-project/xdp-paper/blob/master/benchmarks/bench01_baseline.org#initial-data-from-jespers-runs

Read this article:
 [pci0] https://dl.acm.org/doi/10.1145/3230543.3230560

 [pci1] 

https://www.cl.cam.ac.uk/research/srg/netos/projects/pcie-bench/neugebauer2018understanding.pdf

 [pci2] https://www.cl.cam.ac.uk/research/srg/netos/projects/pcie-bench/
 [pcie-bench] https://github.com/pcie-bench/pcie-model