Re: i40e rx_dropped increasing at 10Mpps

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 23/12/20 14:18, Srivats P wrote:
Hi,

I'm running into i40e rx_dropped constantly increasing (but not
port.rx_dropped) with my XDP program which is returning XDP_DROP after
some basic parsing and accounting. The incoming rate is 10Mpps - a
rate that XDP should easily be able to handle.

I asked for pointers to debug this issue on the Cillium & eBPF Slack
and Jesper suggested I check the CPU idle states.

Moving that conversation to this list (which I learnt about from
Jesper via that slack thread) as I want to post a bunch of logs and
also move the discussion to a public forum that will be publicly
searchable for anyone else who may be facing a similar problem.

Jesper had pointed me to these two threads -

https://lists.bufferbloat.net/pipermail/bloat/2020-November/016019.html
https://lore.kernel.org/xdp-newbies/VI1PR04MB3104C89EF8DCB98F5330F36C9E040@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/

After going through those and educating myself on CPU C-states and
reading up on XL710 performance tuning and general system tuning, here
are my experiments and observations. I'm still not able to reach a
state of rx_dropped of 0.

Any pointers or suggestions on how to reduce rx_dropped to 0 are welcome!

(my apologies for the long email with all the logs)

Setup
=====
* Dual port XL710 (i40e) connected back to back
* Each port is configured with 2 queues (ethtool -L <> combined 2)
* Application uses Queue 0 only for Tx and Queue 1 only for Rx
(ethtool -X <> weight 0 1)
* irqbalance has been stopped and each of the 4 queues has been
manually pinned to a CPU core using /proc/irq/X/smp_affinity
* ens6f0 transmits 10Mpps (64 bytes) on Q0 which is received on ens6f1
Q1 based on the above configurations
* ens6f1 (the rx port) is configured in promisc mode

Baseline Test
=============
$ tuned-adm active
Current active profile: balanced

$ scripts/myethstats
Wed Dec 23 12:44:47 IST 2020
ens6f1: portrx: 10030972 rx:  7010503 pps     drop:  3020434 pps

Wed Dec 23 12:44:52 IST 2020
ens6f1: portrx: 10035210 rx:  6982223 pps     drop:  3052996 pps

Note: ethtool -S stats
portrx => port.rx_size_64
rx => rx_packets
drop => rx_dropped
Although port.rx_dropped is not shown above, it is always 0 in all my tests

$ mpstat -P3
Linux 5.4.0-52-generic (dumbledore) Wednesday 23 December 2020 _x86_64_ (8 CPU)

12:45:23  IST  CPU    %usr   %nice    %sys %iowait    %irq   %soft
%steal  %guest  %gnice   %idle
12:45:28  IST    3    0.00    0.00    0.20    0.00    0.00   51.29
0.00    0.00    0.00   48.51

12:45:28  IST  CPU    %usr   %nice    %sys %iowait    %irq   %soft
%steal  %guest  %gnice   %idle
12:45:33  IST    3    0.00    0.00    0.20    0.00    0.00   50.90
0.00    0.00    0.00   48.90

$ sudo turbostat -c3 --quiet | grep -v "-"

Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI POLL C1 C1E C3 C6 POLL%
C1% C1E% C3% C6% CPU%c1 CPU%c3 CPU%c6 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3
Pkg%pc6 PkgWatt RAMWatt PKG_% RAM_%
3 3 962 51.72 1860 2100 11850 0 32 54 188 9780 0 0.00 0.08 0.14 48.27
0.00 28.48 19.80 0.00 42
Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI POLL C1 C1E C3 C6 POLL%
C1% C1E% C3% C6% CPU%c1 CPU%c3 CPU%c6 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3
Pkg%pc6 PkgWatt RAMWatt PKG_% RAM_%
3 3 964 51.78 1862 2100 11892 0 47 56 169 9932 0 0.00 0.06 0.14 48.23
0.00 28.53 19.69 0.00 43

Test Summary
------------
With the default 'balanced' power profile, 3Mpps out of the total
10Mpps incoming on the port is rx_dropped. The corresponding CPU core
is only 51% busy - all the time spent in softirq processing - I guess
the XDP program is executed in net_rx_action context? The core spends
significant time in C1, C3 states as it doesn't have enough work to
do.

Test 1 - throughput-performance profile
=======================================

$ sudo tuned-adm profile thoroughput-performance

$ scripts/myethstats
Wed Dec 23 12:47:56 IST 2020
ens6f1: portrx: 10018188 rx:  9817459 pps     drop:   200747 pps

Wed Dec 23 12:48:01 IST 2020
ens6f1: portrx: 10025587 rx:  9812874 pps     drop:   212738 pps

$ mpstat -P 3 5
Linux 5.4.0-52-generic (dumbledore) Wednesday 23 December 2020 _x86_64_ (8 CPU)

12:48:25  IST  CPU    %usr   %nice    %sys %iowait    %irq   %soft
%steal  %guest  %gnice   %idle
12:48:30  IST    3    0.40    0.00    0.80    0.00    0.00   48.89
0.00    0.00    0.00   49.90

12:48:30  IST  CPU    %usr   %nice    %sys %iowait    %irq   %soft
%steal  %guest  %gnice   %idle
12:48:35  IST    3    0.99    0.00    0.79    0.00    0.00   49.50
0.00    0.00    0.00   48.71

$ sudo turbostat -c3 --quiet | grep -v "-"
Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI POLL C1 C1E C3 C6 POLL%
C1% C1E% C3% C6% CPU%c1 CPU%c3 CPU%c6 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3
Pkg%pc6 PkgWatt RAMWatt PKG_% RAM_%
3 3 1403 50.22 2793 2100 11937 0 54 51 211 10657 0 0.00 0.12 0.17
49.43 0.00 28.83 20.95 0.00 44
Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI POLL C1 C1E C3 C6 POLL%
C1% C1E% C3% C6% CPU%c1 CPU%c3 CPU%c6 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3
Pkg%pc6 PkgWatt RAMWatt PKG_% RAM_%
3 3 1417 50.66 2797 2100 11849 0 66 70 206 10656 0 0.00 0.14 0.15
49.02 0.00 28.54 20.80 0.00 43

$ perfsudo perf stat -C3
^C
  Performance counter stats for 'CPU(s) 3':

           6,193.92 msec cpu-clock                 #    1.000 CPUs
utilized
             13,107      context-switches          #    0.002 M/sec
                  4      cpu-migrations            #    0.001 K/sec
                  0      page-faults               #    0.000 K/sec
     8,90,40,71,271      cycles                    #    1.438 GHz
    19,76,25,77,895      instructions              #    2.22  insn per
cycle
     3,64,81,07,563      branches                  #  588.982 M/sec
        6,82,90,187      branch-misses             #    1.87% of all
branches

        6.193830786 seconds time elapsed

$ sudo perf stat -Cperf top -C3 -E 10 --stdio

    PerfTop:       0 irqs/sec  kernel: 0.0%  exact:  0.0% lost: 0/0
drop: 0/0 [4000Hz cycles],  (all, CPU: 3)
---------------------------------------------------------------------------------------------------------------------------------------------------------------


    PerfTop:    2730 irqs/sec  kernel:100.0%  exact: 100.0% lost: 0/0
drop: 0/0 [4000Hz cycles],  (all, CPU: 3)
---------------------------------------------------------------------------------------------------------------------------------------------------------------

     45.26%  [kernel]                            [k] i40e_clean_rx_irq
     30.79%  bpf_prog_265dfd693e49d7ec_xdp_prog  [k]
bpf_prog_265dfd693e49d7ec_xdp_prog
      6.16%  [kernel]                            [k] i40e_alloc_rx_buffers
      4.02%  [kernel]                            [k] percpu_array_map_lookup_elem
      2.83%  [kernel]                            [k]
dma_direct_sync_single_for_device
      2.58%  [kernel]                            [k]
dma_direct_sync_single_for_cpu
      2.20%  [kernel]                            [k] i40e_napi_poll
      1.40%  [kernel]                            [k] net_rx_action
      1.06%  [kernel]                            [k]
i40e_clean_programming_status
      0.90%  [kernel]                            [k] __x86_indirect_thunk_rax

Test Summary
------------
Switching to the throughput-performance power profile reduces
rx_dropped from ~3Mpps to ~200Kpps. CPU usage is still at ~50% with
similar time spent in C1, C3 as the baseline. Definite improvement,
but not close to where it should be. I've included output for 'perf
stat' and 'perf top' in case that provides any pointers to the eyes of
the experts on this list.


Test 2 - Interrupt Rates
========================
These are the default settings before I started experimenting with them -

$ ethtool -c ens6f1
Coalesce parameters for ens6f1:
Adaptive RX: on  TX: on
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 50
rx-frames: 0
rx-usecs-irq: 0
rx-frames-irq: 256

tx-usecs: 50
tx-frames: 0
tx-usecs-irq: 0
tx-frames-irq: 256

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0

$ sudo ethtool -C ens6f1 rx-adaptive off tx-adaptive off

$ myethscripts/myethstats
Wed Dec 23 12:52:01 IST 2020
ens6f1: portrx: 10018143 rx:  9856910 pps     drop:   161228 pps

Wed Dec 23 12:52:06 IST 2020
ens6f1: portrx: 10020724 rx:  9797166 pps     drop:   223558 pps

$ sudo turbostat -c3 --quiet | grep -v "-"
Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI POLL C1 C1E C3 C6 POLL%
C1% C1E% C3% C6% CPU%c1 CPU%c3 CPU%c6 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3
Pkg%pc6 PkgWatt RAMWatt PKG_% RAM_%
3 3 1437 51.30 2802 2100 11631 0 67 54 179 10406 0 0.00 0.10 0.14
48.40 0.00 28.08 20.62 0.00 44
Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI POLL C1 C1E C3 C6 POLL%
C1% C1E% C3% C6% CPU%c1 CPU%c3 CPU%c6 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3
Pkg%pc6 PkgWatt RAMWatt PKG_% RAM_%
3 3 1407 50.32 2796 2100 11930 0 75 80 185 10803 0 0.00 0.15 0.19
49.32 0.00 28.74 20.93 0.00 44

----

$ sudo ethtool -C ens6f1 rx-usecs 0 tx-usecs 0

$ scripts/myethstats
Wed Dec 23 12:54:33 IST 2020
ens6f1: portrx: 10019482 rx:  9651955 pps     drop:   367483 pps

Wed Dec 23 12:54:38 IST 2020
ens6f1: portrx: 10020248 rx:  9723667 pps     drop:   296625 pps

$ sudo turbostat -c3 --quiet | grep -v "-
Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI POLL C1 C1E C3 C6 POLL%
C1% C1E% C3% C6% CPU%c1 CPU%c3 CPU%c6 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3
Pkg%pc6 PkgWatt RAMWatt PKG_% RAM_%
3 3 1417 50.64 2798 2100 11798 0 78 60 203 10582 0 0.00 0.15 0.17
49.02 0.00 28.50 20.87 0.00 45
Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI POLL C1 C1E C3 C6 POLL%
C1% C1E% C3% C6% CPU%c1 CPU%c3 CPU%c6 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3
Pkg%pc6 PkgWatt RAMWatt PKG_% RAM_%
3 3 1412 50.48 2797 2100 11715 0 70 73 184 10496 0 0.00 0.14 0.15
49.20 0.00 28.15 21.37 0.00 44

---

$ sudo ethtool -C ens6f1 rx-usecs 0 tx-usecs  10 tx-usecs 10

$ sudo turbostat -c3 --quiet | grep -v "-"
Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI POLL C1 C1E C3 C6 POLL%
C1% C1E% C3% C6% CPU%c1 CPU%c3 CPU%c6 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3
Pkg%pc6 PkgWatt RAMWatt PKG_% RAM_%
3 3 1423 50.83 2799 2100 11879 0 81 69 200 10478 0 0.00 0.25 0.20
48.68 0.00 28.35 20.82 0.00 44
Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI POLL C1 C1E C3 C6 POLL%
C1% C1E% C3% C6% CPU%c1 CPU%c3 CPU%c6 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3
Pkg%pc6 PkgWatt RAMWatt PKG_% RAM_%
3 3 1410 50.44 2796 2100 11843 0 75 88 208 10668 0 0.00 0.17 0.16
49.21 0.00 28.60 20.96 0.00 44

$ ethtool -C ens6f1 rx-usecs 300 tx-usecs 300

$ sudo turbostat -c3 --quiet | grep -v "-"
Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI POLL C1 C1E C3 C6 POLL%
C1% C1E% C3% C6% CPU%c1 CPU%c3 CPU%c6 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3
Pkg%pc6 PkgWatt RAMWatt PKG_% RAM_%
3 3 1397 50.09 2789 2100 11595 0 65 70 203 10488 0 0.00 0.15 0.14
49.59 0.00 28.95 20.96 0.00 44
Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI POLL C1 C1E C3 C6 POLL%
C1% C1E% C3% C6% CPU%c1 CPU%c3 CPU%c6 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3
Pkg%pc6 PkgWatt RAMWatt PKG_% RAM_%
3 3 1398 50.08 2791 2100 11548 0 86 59 205 10474 0 0.00 0.20 0.15
49.52 0.00 28.92 21.00 0.00 44

Test Summary
------------
Changing interrupt coaelescing params doesn't seem to make any impact.
The IRQ rate shown by turbostat remains similar. Time spent in C1/C3
is also similar. Am I doing something wrong here or missing something?

Before going on to next tests, I restored the interrupt coalescing
params to the default.

Test 3 - Ring size changes
==========================
$ sudo ethtool -G ens6f1 rx 4096 tx 4096

$ scripts/myethstats
Wed Dec 23 13:15:56 IST 2020
ens6f1: portrx: 10019865 rx: 10017373 pps     drop:     2466 pps

Wed Dec 23 13:16:01 IST 2020
ens6f1: portrx: 10020013 rx: 10018517 pps     drop:     1502 pps

Wed Dec 23 13:16:06 IST 2020
ens6f1: portrx: 10021465 rx: 10021209 pps     drop:        0 pps

Wed Dec 23 13:16:11 IST 2020
ens6f1: portrx: 10020225 rx: 10018721 pps     drop:     1759 pps

$ mpstat -P 3
GLinux 5.4.0-52-generic (dumbledore) Wednesday 23 December 2020 _x86_64_ (8 CPU)

01:16:33  IST  CPU    %usr   %nice    %sys %iowait    %irq   %soft
%steal  %guest  %gnice   %idle
01:16:38  IST    3    0.00    0.00    0.00    0.00    0.00   52.31
0.00    0.00    0.00   47.69

01:16:38  IST  CPU    %usr   %nice    %sys %iowait    %irq   %soft
%steal  %guest  %gnice   %idle
01:16:43  IST    3    0.00    0.00    0.21    0.00    0.00   50.31
0.00    0.00    0.00   49.48

$ sudo turbostat -c3 --quiet | grep -v "-"
Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI POLL C1 C1E C3 C6 POLL%
C1% C1E% C3% C6% CPU%c1 CPU%c3 CPU%c6 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3
Pkg%pc6 PkgWatt RAMWatt PKG_% RAM_%
3 3 1427 52.99 2693 2100 12501 0 14 45 226 10782 0 0.00 0.02 0.22
46.70 0.00 29.45 17.56 0.00 45
Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI POLL C1 C1E C3 C6 POLL%
C1% C1E% C3% C6% CPU%c1 CPU%c3 CPU%c6 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3
Pkg%pc6 PkgWatt RAMWatt PKG_% RAM_%
3 3 1424 52.86 2693 2100 12134 0 15 53 199 10445 0 0.00 0.05 0.24
46.76 0.00 29.14 18.00 0.00 45

$ sudo perfstat -C
^C
  Performance counter stats for 'CPU(s) 3':

           6,126.12 msec cpu-clock                 #    1.000 CPUs
utilized
             13,779      context-switches          #    0.002 M/sec
                  0      cpu-migrations            #    0.000 K/sec
                156      page-faults               #    0.025 K/sec
     8,85,23,19,647      cycles                    #    1.445 GHz
    19,90,91,37,819      instructions              #    2.25  insn per
cycle
     3,67,47,65,127      branches                  #  599.852 M/sec
        6,89,37,131      branch-misses             #    1.88% of all
branches

        6.125995233 seconds time elapsed


$ sudo perf top -C 3 -E 10 --stdio

    PerfTop:       0 irqs/sec  kernel: 0.0%  exact:  0.0% lost: 0/0
drop: 0/0 [4000Hz cycles],  (all, CPU: 3)
---------------------------------------------------------------------------------------------------------------------------------------------------------------


    PerfTop:    2731 irqs/sec  kernel:100.0%  exact: 100.0% lost: 0/0
drop: 0/0 [4000Hz cycles],  (all, CPU: 3)
---------------------------------------------------------------------------------------------------------------------------------------------------------------

     43.68%  [kernel]                            [k] i40e_clean_rx_irq
     31.53%  bpf_prog_265dfd693e49d7ec_xdp_prog  [k]
bpf_prog_265dfd693e49d7ec_xdp_prog
      5.58%  [kernel]                            [k] i40e_alloc_rx_buffers
      3.48%  [kernel]                            [k] percpu_array_map_lookup_elem
      3.29%  [kernel]                            [k]
dma_direct_sync_single_for_cpu
      2.83%  [kernel]                            [k]
dma_direct_sync_single_for_device
      2.00%  [kernel]                            [k] i40e_napi_poll
      1.35%  [kernel]                            [k] net_rx_action
      1.25%  [kernel]                            [k]
i40e_clean_programming_status
      1.14%  [kernel]                            [k] __x86_indirect_thunk_rax

$ sudo ethtool -G ens6f1 rx 256 tx 256

$ scripts/myethstats
Wed Dec 23 13:21:31 IST 2020
ens6f1: portrx: 10018030 rx:  9575847 pps     drop:   442182 pps

Wed Dec 23 13:21:36 IST 2020
ens6f1: portrx: 10025894 rx:  9579994 pps     drop:   445900 pps

$ tuned-adm active
Current active profile: throughput-performance

Test Summary
------------
Increasing Rx ring size from 512 to 4096 alongwith changing the
profile to throughput-performance has the most impact in reducing
rx_dropped. But still it is not 0 which is where I need it to be.
Reducing rx ring size to 256 (to check DDIO impact) increases
rx_dropped, so that's a no go.


Driver/Firmware versions
========================
$ ethtool -i ens6f1
driver: i40e
version: 2.8.20-k
firmware-version: 6.01 0x800035d8 1.1747.0
expansion-rom-version:
bus-info: 0000:4b:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

Distro/Kernel Version
=====================
Ubuntu 18.04 LTS running Linux 5.4.0-52-generic

Srivats


Hi Srivats,
this is also a follow up to the thread you mentioned at the beginning of the message. I found this very interesting paper [1] talking about Intel DDIO. To summarize, it should be possible to tune the number of ways of the cache that can be used by DDIO, writing to the "IIO LLC WAYS" register (address 0xc8b) of the processor (section 4.4 of the paper). Unfortunately there's almost no documentation on this register, however I tried to write to it and it seems to work on my processor (Xeon Gold 5120). The register contains a bitmask of cache ways that can be used by DDIO, by default two ways should be enabled (but this of course depends on your processor) correspondig to mask 0x600, and on my machine I can enable up to all 11 ways with mask 0x7ff. If you want to give it a try download the Intel MSR Tools [2] to operate on the register, build them and then use rdmsr and wrmsr commands to read/write the register (you need to issue "sudo modprobe msr" before).
For example:

sudo wrmsr 0xc8b 0x7ff

to enable all ways.
On my machine enabling all the ways and setting the rx ring size to 4096 I'm able to achieve no dropped packets using the xdp_rxq_info sample of the kernel (at 10 Mpps).
What processor are you using?

Best regards,
Federico


[1] https://www.usenix.org/system/files/atc20-farshin.pdf
[2] https://github.com/intel/msr-tools



[Index of Archives]     [Linux Networking Development]     [Fedora Linux Users]     [Linux SCTP]     [DCCP]     [Gimp]     [Yosemite Campsites]

  Powered by Linux