Re: Multi-core scalability problems

Federico Parola <fede.parola@xxxxxxxxxx> · Mon, 19 Oct 2020 17:23:18 +0200

On 15/10/20 15:22, Jesper Dangaard Brouer wrote:

On Thu, 15 Oct 2020 14:04:51 +0200
Federico Parola <fede.parola@xxxxxxxxxx> wrote:

On 14/10/20 16:26, Jesper Dangaard Brouer wrote:

On Wed, 14 Oct 2020 14:17:46 +0200
Federico Parola <fede.parola@xxxxxxxxxx> wrote:

On 14/10/20 11:15, Jesper Dangaard Brouer wrote:

On Wed, 14 Oct 2020 08:56:43 +0200
Federico Parola <fede.parola@xxxxxxxxxx> wrote:

[...]

Can you try to use this[2] tool:
     ethtool_stats.pl --dev enp101s0f0

And notice if there are any strange counters.

[2]https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl

[...]

The only solution I've found so far is to reduce the size of the rx ring
as I mentioned in my former post. However I still see a decrease in
performance when exceeding 4 cores.

What is happening when you are reducing the size of the rx ring is two
things. (1) i40e driver have reuse/recycle-pages trick that get less
efficient, but because you are dropping packets early you are not
affected. (2) the total size of L3 memory you need to touch is also
decreased.

I think you are hitting case (2).  The Intel CPU have a cool feature
called DDIO (Data-Direct IO) or DCA (Direct Cache Access), which can
deliver packet data into L3 cache memory (if NIC is directly PCIe
connected to CPU).  The CPU is in charge when this feature is enabled,
and it will try to avoid L3 trashing and disable it in certain cases.
When you reduce the size of the rx rings, then you are also needing
less L3 cache memory, to the CPU will allow this DDIO feature.

You can use the 'perf stat' tool to check if this is happening, by
monitoring L3 (and L2) cache usage.

What events should I monitor? LLC-load-misses/LLC-loads?

Looking at my own results from xdp-paper[1], it looks like that it
results in real 'cache-misses' (perf stat -e cache-misses).

E.g I ran:
  sudo ~/perf stat -C3 -e cycles -e  instructions -e cache-references -e cache-misses -r 3 sleep 1

Notice how the 'insn per cycle' gets less efficient when we experience
these cache-misses.

Also how RX-size of queues affect XDP-redirect in [2].

[1] https://github.com/xdp-project/xdp-paper/blob/master/benchmarks/bench01_baseline.org
[2] https://github.com/xdp-project/xdp-paper/blob/master/benchmarks/bench05_xdp_redirect.org

Hi Jesper, sorry for the late reply, this are the cache refs/misses for 

4 flows and different rx ring sizes:

RX 512 (9.4 Mpps dropped):
Performance counter stats for 'CPU(s) 0,1,2,13' (10 runs):
  23771011  cache-references                                (+-  0.04% )
   8865698  cache-misses      # 37.296 % of all cache refs  (+-  0.04% )

RX 128 (39.4 Mpps dropped):
Performance counter stats for 'CPU(s) 0,1,2,13' (10 runs):
  68177470  cache-references                               ( +-  0.01% )
     23898  cache-misses      # 0.035 % of all cache refs  ( +-  3.23% )

Reducing the size of the rx ring brings to a huge decrease in cache 

misses, is this the effect of DDIO turning on?

Federico