Re: Bad XDP performance with mlx5

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 29 May 2019 20:16:46 +0200
Tom Barbette <barbette@xxxxxx> wrote:

> On 2019-05-29 19:16, Jesper Dangaard Brouer wrote:
> > On Wed, 29 May 2019 18:03:08 +0200
> > Tom Barbette <barbette@xxxxxx> wrote:
> >   
> >> Hi all,
> >>
> >> I've got a very simple eBPF program that counts packets per queue in a
> >> per-cpu map.  
> > 
> > Like xdp_rxq_info --dev mlx5p1 --action XDP_PASS ?  
> 
> Even simpler.
> 
> >   
> >> I use IPerf in TCP mode, I limit the CPU cores to 2 so performance is
> >> limited by CPU (always at 100%).
> >>
> >> With a XL710 NIC 40G link, with the XDP program loaded, I get 32.5.
> >> Without I get ~33.3Gbps. Pretty similar, somehow expected.
> >>
> >> With a ConnectX 5 100G link, I get ~33.3Gbps without the XDP program but
> >> ~26 with it. The behavior seems similar with a simple XDP_PASS program.  
> > 
> > Are you sure?  
> 
> 
> xdp_pass.c:
> ---
> #include <linux/bpf.h>
> 
> #ifndef __section
> # define __section(NAME)                  \
>     __attribute__((section(NAME), used))
> #endif
> 
> __section("prog")
> int xdp_drop(struct xdp_md *ctx) {
>      return XDP_PASS;
> }
> 
> char __license[] __section("license") = "GPL";
> ---
> clang -O2 -target bpf -c xdp_pass.c -o xdp_pass.o
> 
> Then see results with netperf below.
> 
> > 
> > My test on a ConnectX 5 100G link show:
> >   - 33.8 Gbits/sec = with no-XDP prog
> >   - 34.5 Gbits/sec - with xdp_rxq_info
> >   
> 
> Even faster? :p
> 
> >> Any idea why MLX5 driver behaves like this? perf top is not conclusive
> >> at first glance. I'd say check_object_size and
> >> copy_user_enhanced_fast_string rise up but the stack is unclear from where.  
> >   
> > It is possible to get very different and varying TCP bandwidth results,
> > depending on if TCP-server-process is running on the same CPU as the
> > NAPI-RX loop.  If they share the CPU then results are worse, as
> > process-context scheduling is setting a limit.  
> 
> IPerf has one instance per-core, with SO_REUSEPORT and a BPF filter to 
> map queues <-> CPU in 1:1 with irqbalance killed and set_affinity*sh.
> So the setup on that regard is similar between tests and the variance do 
> not come from different assignments.
> Which is not what you're advising but ensure a similar per-core 
> "pipeline" and tests reproducibility. It's a side question but any link 
> on this L1/L2 cache misses vs scheduling question is welcome.
> 
> > 
> > This is easiest to demonstrate with netperf option -Tn,n:
> > 
> > $ netperf -H 198.18.1.1 -D1 -T2,2 -t TCP_STREAM -l 120
> > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.1.1 () port 0 AF_INET : histogram : demo : cpu bind
> > Interim result: 35344.39 10^6bits/s over 1.002 seconds ending at 1559149724.219
> > Interim result: 35294.66 10^6bits/s over 1.001 seconds ending at 1559149725.221
> > Interim result: 36112.09 10^6bits/s over 1.002 seconds ending at 1559149726.222
> > Interim result: 36301.13 10^6bits/s over 1.000 seconds ending at 1559149727.222
> > ^CInterim result: 36146.78 10^6bits/s over 0.507 seconds ending at 1559149727.730
> > Recv   Send    Send
> > Socket Socket  Message  Elapsed
> > Size   Size    Size     Time     Throughput
> > bytes  bytes   bytes    secs.    10^6bits/sec
> > 
> > 131072  16384  16384    4.51     35801.94
> >   
> 
> server$ sudo service netperf start
> server$ sudo killall -9 irqbalance
> server$ sudo ethtool -X dpdk1 equal 2

Interesting use of ethtool -X (Set Rx flow hash indirection table), I
could use that myself in some of my tests.  I usually change the number
of RX-queue via ethtool -L (or --set-channels), which the i40e/XL710
have issues with...


> server$ sudo ip link set dev dpdk1 xdp off
> client$ netperf -H 10.220.0.5 -D1 -T2,2 -t TCP_STREAM -l 120
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
> 10.220.0.5 () port 0 AF_INET : demo : cpu bind
> Interim result: 37221.90 10^6bits/s over 1.015 seconds ending at 1559151699.433
> Interim result: 37811.52 10^6bits/s over 1.003 seconds ending at 1559151700.436
> Interim result: 38195.47 10^6bits/s over 1.001 seconds ending at 1559151701.437
> Interim result: 41089.18 10^6bits/s over 1.000 seconds ending at 1559151702.437
> Interim result: 38005.40 10^6bits/s over 1.081 seconds ending at 1559151703.518
> Interim result: 34419.33 10^6bits/s over 1.104 seconds ending at 1559151704.622
> ^CInterim result: 40634.33 10^6bits/s over 0.198 seconds ending at 1559151704.820
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
> 
> 131072  16384  16384    6.41     37758.53
> 
> server$ sudo ip link set dev dpdk1 xdp obj xdp_pass.o
> client$ netperf -H 10.220.0.5 -D1 -T2,2 -t TCP_STREAM -l 120
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
> 10.220.0.5 () port 0 AF_INET : demo : cpu bind
> Interim result: 31669.02 10^6bits/s over 1.021 seconds ending at 1559151575.906
> Interim result: 31164.97 10^6bits/s over 1.016 seconds ending at 1559151576.923
> Interim result: 31525.57 10^6bits/s over 1.001 seconds ending at 1559151577.924
> Interim result: 28835.03 10^6bits/s over 1.093 seconds ending at 1559151579.017
> Interim result: 36336.89 10^6bits/s over 1.000 seconds ending at 1559151580.017
> Interim result: 31021.22 10^6bits/s over 1.171 seconds ending at 1559151581.188
> Interim result: 37469.64 10^6bits/s over 1.000 seconds ending at 1559151582.189
> ^CInterim result: 33209.38 10^6bits/s over 0.403 seconds ending at 
> 1559151582.591
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
> 
> 131072  16384  16384    7.71     32518.84
> 
> server$ sudo ip link set dev dpdk1 xdp off
> server$ sudo ip link set dev dpdk1 xdp obj xdp_count.o
> netperf -H 10.220.0.5 -D1 -T2,2 -t TCP_STREAM -l 120
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
> 10.220.0.5 () port 0 AF_INET : demo : cpu bind
> Interim result: 33090.36 10^6bits/s over 1.019 seconds ending at 1559151856.741
> Interim result: 32823.68 10^6bits/s over 1.008 seconds ending at 1559151857.749
> Interim result: 34766.21 10^6bits/s over 1.000 seconds ending at 1559151858.749
> Interim result: 36246.28 10^6bits/s over 1.034 seconds ending at 1559151859.784
> Interim result: 34757.19 10^6bits/s over 1.043 seconds ending at 1559151860.826
> Interim result: 29434.22 10^6bits/s over 1.181 seconds ending at 1559151862.007
> Interim result: 32619.29 10^6bits/s over 1.004 seconds ending at 
> 1559151863.011
> ^CInterim result: 36102.22 10^6bits/s over 0.448 seconds ending at 
> 1559151863.459
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
> 
> 131072  16384  16384    7.74     33470.75
> 
> There is a higher variance than my iperf test (50 flows) but without is 
> always around 40, while with is ranging from 32 to 37, mostly 32. What 
> I'm more sure of is that XL710 does not exhibit this behavior, with 
> netperf too :
> 
> server$ sudo ip link set dev enp213s0f0 xdp off
> client$ netperf -H 10.230.0.1 -D1 -T2,2 -t TCP_STREAM -l 120
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
> 10.230.0.1 () port 0 AF_INET : demo : cpu bind
> Interim result: 18358.39 10^6bits/s over 1.001 seconds ending at 
> 1559152311.334
> Interim result: 18635.27 10^6bits/s over 1.001 seconds ending at 
> 1559152312.334
> Interim result: 18393.82 10^6bits/s over 1.013 seconds ending at 
> 1559152313.348
> Interim result: 18741.75 10^6bits/s over 1.000 seconds ending at 
> 1559152314.348
> Interim result: 18700.84 10^6bits/s over 1.002 seconds ending at 
> 1559152315.350
> ^CInterim result: 18059.26 10^6bits/s over 0.307 seconds ending at 
> 1559152315.657
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
> 
> 131072  16384  16384    5.33     18523.59
> 
> server$ sudo ip link set dev enp213s0f0 xdp obj xdp_pass.o
> netperf -H 10.230.0.1 -D1 -T2,2 -t TCP_STREAM -l 120
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
> 10.230.0.1 () port 0 AF_INET : demo : cpu bind
> Interim result: 17867.08 10^6bits/s over 1.001 seconds ending at 
> 1559152387.230
> Interim result: 18444.22 10^6bits/s over 1.000 seconds ending at 
> 1559152388.230
> Interim result: 18226.31 10^6bits/s over 1.012 seconds ending at 
> 1559152389.242
> Interim result: 18411.24 10^6bits/s over 1.001 seconds ending at 
> 1559152390.243
> Interim result: 18420.69 10^6bits/s over 1.001 seconds ending at 
> 1559152391.244
> Interim result: 18236.47 10^6bits/s over 1.010 seconds ending at 
> 1559152392.254
> Interim result: 18026.38 10^6bits/s over 1.012 seconds ending at 
> 1559152393.265
> ^CInterim result: 18390.50 10^6bits/s over 0.465 seconds ending at 
> 1559152393.730
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
> 
> 131072  16384  16384    7.50     18236.5
> 
> For some reason, everything happens on the same core with the XL710, but 
> not mlx5 which uses 2 cores (one interrupt/napi and one netserver). Any 
> idea why? TX affinity working with XL710 but not mlx5? Anyway my iperf 
> test would not set that, so the problem does not lie there.

What SMP affinity script are you using?

The mellanox drivers uses another "layout"/name-scheme
in /proc/irq/*/*name*/../smp_affinity_list.

Normal Intel based nics I use this:

echo " --- Align IRQs ---"
# I've named my NICs ixgbe1 + ixgbe2
for F in /proc/irq/*/ixgbe*-TxRx-*/../smp_affinity_list; do
   # Extract irqname e.g. "ixgbe2-TxRx-2"
   irqname=$(basename $(dirname $(dirname $F))) ;
   # Substring pattern removal
   hwq_nr=${irqname#*-*-}
   echo $hwq_nr > $F
   #grep . -H $F;
done
grep -H . /proc/irq/*/ixgbe*/../smp_affinity_list

But for Mellanox I had to use this:

echo " --- Align IRQs : mlx5 ---"
for F in /proc/irq/*/mlx5_comp*/../smp_affinity; do
        dir=$(dirname $F) ;
        cat $dir/affinity_hint > $F
done
grep -H . /proc/irq/*/mlx5_comp*/../smp_affinity_list


> > $ netperf -H 198.18.1.1 -D1 -T1,1 -t TCP_STREAM -l 120
> > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.1.1 () port 0 AF_INET : histogram : demo : cpu bind
> > Interim result: 26990.45 10^6bits/s over 1.000 seconds ending at 1559149733.554
> > Interim result: 27730.35 10^6bits/s over 1.000 seconds ending at 1559149734.554
> > Interim result: 27725.76 10^6bits/s over 1.000 seconds ending at 1559149735.554
> > Interim result: 27513.39 10^6bits/s over 1.008 seconds ending at 1559149736.561
> > Interim result: 27421.46 10^6bits/s over 1.003 seconds ending at 1559149737.565
> > ^CInterim result: 27523.62 10^6bits/s over 0.580 seconds ending at 1559149738.145
> > Recv   Send    Send
> > Socket Socket  Message  Elapsed
> > Size   Size    Size     Time     Throughput
> > bytes  bytes   bytes    secs.    10^6bits/sec
> > 
> > 131072  16384  16384    5.59     27473.50
> >
> >   
> >> I use 5.1-rc3, compiled myself using Ubuntu 18.04's latest .config file.  
> > 
> > I use 5.1.0-bpf-next (with some patches on top of commit 35c99ffa20).
> >   
> I'm rebasing on 5.1.5, I do not wish to go too leading edge on this 
> project (unless needed).
>
> I do have one patch to copy the RSS hash in the xdp_buff, but the field 
> is read even if xdp is disabled.

What is you use-case for this?

Upstream will likely request that this is added as xdp_buff->metadata
and using BTF format... but it is a longer project see[1], and is
currently scheduled as a "medium-term" task... let us know if you want
to work on this...

[1] https://github.com/xdp-project/xdp-project/blob/master/xdp-project.org#metadata-available-to-programs
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer



[Index of Archives]     [Linux Networking Development]     [Fedora Linux Users]     [Linux SCTP]     [DCCP]     [Gimp]     [Yosemite Campsites]

  Powered by Linux