On 13/04/2023 13.16, Toke Høiland-Jørgensen wrote:
Qiongwen Xu <qx51@xxxxxxxxxxxxxx> writes:Hi Jesper, Thanks for the detailed reply and sharing these helpful materials/papers with us!(Please don't top post on the mailing list).
+1
After enabling rx_cqe_compress, the throughput in our experiment increases from 70+Mpps to 85 Mpps. We also tried to use the counter "rx_discards_phy". The counter increases in both cpu-limited and pcie-limited experiments, i.e., in the experiment which is only cpu-limited can also increase the counter. We are looking for any counter that can separate cpu- and pcie-limited cases. Regarding the [pcie-bench] tool, unfortunately, we are not able to use it, as it requires fpga hardware.Well, are your CPUs being maxed out? IIRC it was pretty obvious that they weren't when we were running those tests, so just looking atsomething like 'mpstat' should give you a hint.
As you can see in[1] I find this mpstat command very useful: $ mpstat -P ALL -u -I SCPU -I SUM 2 The tool turbostat will also tell you how busy individial CPUs are.
For more detailed analysis you can use 'perf' to see exactly where the CPU is spending its time.
Again a practical hint. Perf record with cmdline: # perf record -g -a -- sleep 10 Look at results with cmdline that also expose the 'cpu' info: # perf report --sort cpu,dso,symbol --no-children Look at a specific CPU e.g. core 3 (counting from 0) with cmdline: # perf report --sort cpu,dso,symbol --no-children -C3 --Jesper Links:[1] https://github.com/xdp-project/xdp-paper/blob/master/benchmarks/bench02_xdp_drop.org#test-100g-bandwidth