On Sat, 24 Oct 2020 15:57:50 +0200 Federico Parola <fede.parola@xxxxxxxxxx> wrote: > On 19/10/20 20:26, Jesper Dangaard Brouer wrote: > > On Mon, 19 Oct 2020 17:23:18 +0200 > > Federico Parola <fede.parola@xxxxxxxxxx> wrote: > >> > >> [...] > >> > >> Hi Jesper, sorry for the late reply, this are the cache refs/misses for > >> 4 flows and different rx ring sizes: > >> > >> RX 512 (9.4 Mpps dropped): > >> Performance counter stats for 'CPU(s) 0,1,2,13' (10 runs): > >> 23771011 cache-references (+- 0.04% ) > >> 8865698 cache-misses # 37.296 % of all cache refs (+- 0.04% ) > >> > >> RX 128 (39.4 Mpps dropped): > >> Performance counter stats for 'CPU(s) 0,1,2,13' (10 runs): > >> 68177470 cache-references ( +- 0.01% ) > >> 23898 cache-misses # 0.035 % of all cache refs ( +- 3.23% ) > >> > >> Reducing the size of the rx ring brings to a huge decrease in cache > >> misses, is this the effect of DDIO turning on? > > > > Yes, exactly. > > > > It is very high that 37.296 % of all cache refs is being cache-misses. > > The number of cache-misses 8,865,698 is close to your reported 9.4 > > Mpps. Thus, seems to correlate with the idea that this is DDIO-missing > > as you have a miss per packet. > > > > I can see that you have selected a subset of the CPUs (0,1,2,13), it > > important that this is the active CPUs. I usually only select a > > single/individual CPU to make sure I can reason about the numbers. > > I've seen before that some CPUs get DDIO effect and others not, so > > watch out for this. > > > > If you add HW-counter -e instructions -e cycles to your perf stat > > command, you will also see the instructions per cycle calculation. You > > should notice that the cache-miss also cause this number to be reduced, > > as the CPUs stalls it cannot keep the CPU pipeline full/busy. > > > > What kind of CPU are you using? > > Specifically cache-sizes (use dmidecode look for "Cache Information") > > > I'm using an Intel Xeon Gold 5120, L1: 896 KiB, L2: 14 MiB, L3: 19.25 MiB. Is this a NUMA system? The numbers you report is for all cores together. Looking at [1] and [2], I can see this is a 14-cores CPU. According to [3] the cache is: Level 1 cache size: 14 x 32 KB 8-way set associative instruction caches 14 x 32 KB 8-way set associative data caches Level 2 cache size: 14 x 1 MB 16-way set associative caches Level 3 cache size 19.25 MB 11-way set associative non-inclusive shared cache One thing that catch my eye is the "non-inclusive" cache. And that [4] states "rearchitected cache hierarchy designed for server workloads". [1] https://en.wikichip.org/wiki/intel/xeon_gold/5120 [2] https://ark.intel.com/content/www/us/en/ark/products/120474/intel-xeon-gold-5120-processor-19-25m-cache-2-20-ghz.html [3] https://www.cpu-world.com/CPUs/Xeon/Intel-Xeon%205120.html [4] https://en.wikichip.org/wiki/intel/xeon_gold > > The performance drop is a little too large 39.4 Mpps -> 9.4 Mpps. > > > > If I were you, I would measure the speed of the memory, via using the > > tool lmbench-3.0 command 'lat_mem_rd'. > > > > /usr/lib/lmbench/bin/x86_64-linux-gnu/lat_mem_rd 2000 128 > > > > The output is the nanosec latency of accessing increasing sizes of > > memory. The jumps/increases in latency should be fairly clear and > > shows the latency of the different cache levels. For my CPU E5-1650 v4 > > @ 3.60GHz with 15MB L2 cache, I see L1=1.055ns, L2=5.521ns, L3=17.569ns. > > (I could not find a tool that tells me the cost of accessing main-memory, > > but maybe it is the 17.569ns, as the tool measurement jump from 12MB > > (5.933ns) to 16MB (12.334ns) and I know L3 is 15MB, so I don't get an > > accurate L3 measurement.) > > > I run the benchmark, I can see to well distinct jumps (L1 and L2 cache I > guess) of 1.543ns and 5.400ns, but then the latency grows gradually: Guess you left out some numbers below for the 1.543ns measurement you mention in the text. There is a plateau at 5.508ns, and another at plateau 8.629ns, which could be L3? > 0.25000 5.400 > 0.37500 5.508 > 0.50000 5.508 > 0.75000 6.603 > 1.00000 8.247 > 1.50000 8.616 > 2.00000 8.747 > 3.00000 8.629 > 4.00000 8.629 > 6.00000 8.676 > 8.00000 8.800 > 12.00000 9.119 > 16.00000 10.840 > 24.00000 16.650 > 32.00000 19.888 > 48.00000 21.582 > 64.00000 22.519 > 96.00000 23.473 > 128.00000 24.125 > 192.00000 24.777 > 256.00000 25.124 > 384.00000 25.445 > 512.00000 25.642 > 768.00000 25.775 > 1024.00000 25.869 > 1536.00000 25.942 > I can't really tell where L3 cache and main memory start. I guess the plateau around 25.445ns is the main memory speed. The latency different is very large, but the performance drop is still too large 39.4 Mpps -> 9.4 Mpps. Back-of-envelope calc, 8.629ns to 25.445ns is approx a factor 3 (25.445/8.629=2.948). 9.4 Mpps x factor is 27.7Mpps, 39.4 Mpps div factor is 13.36Mpps. Meaning it doesn't add-up to explain this difference. > One thing I forgot to mention is that I experience the same performance > drop even without specifying the --readmem flag of the bpf sample > (no_touch mode), if I'm not wrong without the flag the ebpf program > should not access to the packet buffer and therefore the DDIO should > have no effect. I was going to ask you to swap between --readmem flag and no_touch mode, and then measure if perf-stat cache-misses stay the same. It sounds like you already did this? The DDIO/DCA is something the CPU chooses to do, based on proprietary design by Intel. Thus, it is hard to say why DDIO is acting like this. E.g. still causing a cache-miss even when using no_touch mode. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer