On Thu, 15 Oct 2020 14:04:51 +0200 Federico Parola <fede.parola@xxxxxxxxxx> wrote: > On 14/10/20 16:26, Jesper Dangaard Brouer wrote: > > On Wed, 14 Oct 2020 14:17:46 +0200 > > Federico Parola <fede.parola@xxxxxxxxxx> wrote: > > > >> On 14/10/20 11:15, Jesper Dangaard Brouer wrote: > >>> On Wed, 14 Oct 2020 08:56:43 +0200 > >>> Federico Parola <fede.parola@xxxxxxxxxx> wrote: > >>> > >>> [...] > >>>>> Can you try to use this[2] tool: > >>>>> ethtool_stats.pl --dev enp101s0f0 > >>>>> > >>>>> And notice if there are any strange counters. > >>>>> > >>>>> > >>>>> [2]https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl [...] > >> The only solution I've found so far is to reduce the size of the rx ring > >> as I mentioned in my former post. However I still see a decrease in > >> performance when exceeding 4 cores. > > > > What is happening when you are reducing the size of the rx ring is two > > things. (1) i40e driver have reuse/recycle-pages trick that get less > > efficient, but because you are dropping packets early you are not > > affected. (2) the total size of L3 memory you need to touch is also > > decreased. > > > > I think you are hitting case (2). The Intel CPU have a cool feature > > called DDIO (Data-Direct IO) or DCA (Direct Cache Access), which can > > deliver packet data into L3 cache memory (if NIC is directly PCIe > > connected to CPU). The CPU is in charge when this feature is enabled, > > and it will try to avoid L3 trashing and disable it in certain cases. > > When you reduce the size of the rx rings, then you are also needing > > less L3 cache memory, to the CPU will allow this DDIO feature. > > > > You can use the 'perf stat' tool to check if this is happening, by > > monitoring L3 (and L2) cache usage. > > What events should I monitor? LLC-load-misses/LLC-loads? Looking at my own results from xdp-paper[1], it looks like that it results in real 'cache-misses' (perf stat -e cache-misses). E.g I ran: sudo ~/perf stat -C3 -e cycles -e instructions -e cache-references -e cache-misses -r 3 sleep 1 Notice how the 'insn per cycle' gets less efficient when we experience these cache-misses. Also how RX-size of queues affect XDP-redirect in [2]. [1] https://github.com/xdp-project/xdp-paper/blob/master/benchmarks/bench01_baseline.org [2] https://github.com/xdp-project/xdp-paper/blob/master/benchmarks/bench05_xdp_redirect.org -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer