Re: Question about xdp: how to figure out the throughput is limited by pcie

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Andi and Acme,

Regarding below discussion and subj (top-posting as you don't need to
read discussion to answer my perf questions).

Can we somehow use perf to profile things happening in PCIe ?
E.g. Are there any PMU counters "uncore" events for PCIe ?

  Hint, we can list more PMU counter via Andi's ocperf tool[42].
  # sudo ./ocperf list

Could we use the TopDown [toplev] model, to indicate/detect that the
PCIe device (or PCIe root complex) is the bottleneck?

  Hint, try out the [toplev] tool looking at specific core under-load
  # sudo ./toplev.py -I 3000 -l3 -a --show-sample --core C2

--Jesper

 [toplev] https://github.com/andikleen/pmu-tools/wiki/toplev-manual
 [42] https://github.com/andikleen/pmu-tools

On 13/04/2023 04.54, Qiongwen Xu wrote:
Hi Jesper,

Thanks for the detailed reply and sharing these helpful materials/papers with us!

After enabling rx_cqe_compress, the throughput in our experiment increases from
70+Mpps to 85 Mpps. We also tried to use the counter "rx_discards_phy". The counter
increases in both cpu-limited and pcie-limited experiments, i.e., in the experiment
which is only cpu-limited can also increase the counter. We are looking for any
counter that can separate cpu- and pcie-limited cases. Regarding the [pcie-bench] tool,
unfortunately, we are not able to use it, as it requires fpga hardware.

Thanks,
Qiongwen

From: Jesper Dangaard Brouer <jbrouer@xxxxxxxxxx>
Date: Sunday, April 9, 2023 at 11:46 AM
Subject: Re: Question about xdp: how to figure out the throughput is limited by pcie
(answered inline below)

On 07/04/2023 03.46, Qiongwen Xu wrote:
Dear XDP experts,

I am a PhD student at Rutgers. Recently, I have been reading the XDP
paper "The eXpress Data Path: Fast Programmable Packet Processing
in the Operating System Kernel". In section 4.1 and 4.3, you mention
the throughputs of xdp programs (packet drop and packet forwarding)
are limited by the PCIe (e.g., "Both scale their performance linearly
until they approach the global performance limit of the PCI bus").

Most of the article[1][2] authors are likely this mailing list,
including me. (Sad to see we called it "PCI *bus*" and not just PCIe).

I am curious about how you figured out it was the PCIe limitation.

It is worth noting that the PCIe limitation shown in article is related
to number of PCIe transactions with small packets (Ethernet minimum
frame size 64 Bytes). (Thus meaning NOT bandwidth related).

The observations that lead to the PCIe limitation conclusion:
A single CPU doing XDP_DROP (25Mpps) was using 100% CPU time (runtime
attributed to ksoftirqd).  When we scaled up XDP_DROP to run on more
CPUs we saw something strange[3].  It scaled linear to 3 CPUs, and at 4
CPUs each CPU started to process less packets per sec (pps) and total
(86Mpps) stayed the same.  Even more strange the CPUs wasn't using 100%
CPU any-longer, CPUs had "time" to idle.  Looking at ethtool stats, we
noticed the counter "rx_discards_phy", which (we were told) happens when
PCIe causes backpressure.

What confirmed the PCIe (transactions) bottleneck was[4] when we
discovered enabling the mlx5 priv-flags rx_cqe_compress=on (and
rx_striding_rq=off) changed the total limit (86Mpps to 108Mpps),
as rx_cqe_compress reduce the transactions on PCIe by compressing the RX
descriptors.  Thus, confirming this was related to PCIe.


  > Is there any tool or method to check this?

I *highly* recommend that you read this article [pci1][pci2]:
   - Title: "Understanding PCIe performance for end host networking"

I wish we had read and referenced this article in ours (but both
happened in 2018).  They give a theoretical model for PCIe, both
bandwidth and latency.  That could be used to explain our PCIe
observations. They also released their [pcie-bench] tool.

I wish more (kernel) performance people understood, that PCIe is a
protocol (3-layers: physical, data link layer (DLL) and Transaction
Layer Packets (TLP)), that is used between the device and host
OS-driver.  In networking usually ignores this PCIe protocol step, with
associated protocol overheads, which actually causes a network packet to
be split into smaller PCIe TLP "packets" with their own PCIe level
headers. Besides the packet data itself, the PCIe protocol is used for
reading TX desc (seen from device) and writing RX desc (seen from
device), and read/update queue pointers.

It might surprise people that article [pci1] shows, that PCIe (128B
payload) introduces a latency around 600ns (nanosec), which is
significantly larger than the inter-packet gap needed for wirespeed
networking.  Thus, latency hiding happens "behind our back", via the
device and DMA engine have to keep many transactions in-flight to
utilize the NIC (yet another hidden queue in the system).

--Jesper

Links:

   [1] https://dl.acm.org/doi/10.1145/3281411.3281443
   [2] https://github.com/xdp-project/xdp-paper
   [3]
https://github.com/xdp-project/xdp-paper/blob/master/benchmarks/bench02_xdp_drop.org
   [4]
https://github.com/xdp-project/xdp-paper/blob/master/benchmarks/bench01_baseline.org

Read this article:
   [pci0] https://dl.acm.org/doi/10.1145/3230543.3230560
   [pci1]
https://www.cl.cam.ac.uk/research/srg/netos/projects/pcie-bench/neugebauer2018understanding.pdf
   [pci2] https://www.cl.cam.ac.uk/research/srg/netos/projects/pcie-bench/
   [pcie-bench] https://github.com/pcie-bench/pcie-model





[Index of Archives]     [Linux Networking Development]     [Fedora Linux Users]     [Linux SCTP]     [DCCP]     [Gimp]     [Yosemite Campsites]

  Powered by Linux