On 15/10/20 15:22, Jesper Dangaard Brouer wrote:
Hi Jesper, sorry for the late reply, this are the cache refs/misses for 4 flows and different rx ring sizes:On Thu, 15 Oct 2020 14:04:51 +0200 Federico Parola <fede.parola@xxxxxxxxxx> wrote:On 14/10/20 16:26, Jesper Dangaard Brouer wrote:On Wed, 14 Oct 2020 14:17:46 +0200 Federico Parola <fede.parola@xxxxxxxxxx> wrote:On 14/10/20 11:15, Jesper Dangaard Brouer wrote:On Wed, 14 Oct 2020 08:56:43 +0200 Federico Parola <fede.parola@xxxxxxxxxx> wrote: [...]Can you try to use this[2] tool: ethtool_stats.pl --dev enp101s0f0 And notice if there are any strange counters. [2]https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl[...]The only solution I've found so far is to reduce the size of the rx ring as I mentioned in my former post. However I still see a decrease in performance when exceeding 4 cores.What is happening when you are reducing the size of the rx ring is two things. (1) i40e driver have reuse/recycle-pages trick that get less efficient, but because you are dropping packets early you are not affected. (2) the total size of L3 memory you need to touch is also decreased. I think you are hitting case (2). The Intel CPU have a cool feature called DDIO (Data-Direct IO) or DCA (Direct Cache Access), which can deliver packet data into L3 cache memory (if NIC is directly PCIe connected to CPU). The CPU is in charge when this feature is enabled, and it will try to avoid L3 trashing and disable it in certain cases. When you reduce the size of the rx rings, then you are also needing less L3 cache memory, to the CPU will allow this DDIO feature. You can use the 'perf stat' tool to check if this is happening, by monitoring L3 (and L2) cache usage.What events should I monitor? LLC-load-misses/LLC-loads?Looking at my own results from xdp-paper[1], it looks like that it results in real 'cache-misses' (perf stat -e cache-misses). E.g I ran: sudo ~/perf stat -C3 -e cycles -e instructions -e cache-references -e cache-misses -r 3 sleep 1 Notice how the 'insn per cycle' gets less efficient when we experience these cache-misses. Also how RX-size of queues affect XDP-redirect in [2]. [1] https://github.com/xdp-project/xdp-paper/blob/master/benchmarks/bench01_baseline.org [2] https://github.com/xdp-project/xdp-paper/blob/master/benchmarks/bench05_xdp_redirect.org
RX 512 (9.4 Mpps dropped): Performance counter stats for 'CPU(s) 0,1,2,13' (10 runs): 23771011 cache-references (+- 0.04% ) 8865698 cache-misses # 37.296 % of all cache refs (+- 0.04% ) RX 128 (39.4 Mpps dropped): Performance counter stats for 'CPU(s) 0,1,2,13' (10 runs): 68177470 cache-references ( +- 0.01% ) 23898 cache-misses # 0.035 % of all cache refs ( +- 3.23% )Reducing the size of the rx ring brings to a huge decrease in cache misses, is this the effect of DDIO turning on?
Federico