Re: [Linuxarm] Re: [PATCH net v3] net: sched: fix packet stuck problem for lockless qdisc

Michal Kubecek <mkubecek@xxxxxxx> · Mon, 19 Apr 2021 17:25:27 +0200

On Mon, Apr 19, 2021 at 08:21:38PM +0800, Yunsheng Lin wrote:
> On 2021/4/19 10:04, Yunsheng Lin wrote:
> > On 2021/4/19 6:59, Michal Kubecek wrote:
> >> I tried this patch o top of 5.12-rc7 with real devices. I used two
> >> machines with 10Gb/s Intel ixgbe NICs, sender has 16 CPUs (2 8-core CPUs
> >> with HT disabled) and 16 Rx/Tx queues, receiver has 48 CPUs (2 12-core
> >> CPUs with HT enabled) and 48 Rx/Tx queues. With multiple TCP streams on
> >> a saturated ethernet, the CPU consumption grows quite a lot:
> >>
> >>     threads     unpatched 5.12-rc7    5.12-rc7 + v3   
> >>       1               25.6%               30.6%
> >>       8               73.1%              241.4%
> >>     128              132.2%             1012.0%
> 
> I do not really read the above number, but I understand that v3 has a cpu
> usage impact when it is patched to 5.12-rc7, so I do a test too on a arm64
> system with a hns3 100G netdev, which is in node 0, and node 0 has 32 cpus.
> 
> root@(none)$ cat /sys/class/net/eth4/device/numa_node
> 0
> root@(none)$ numactl -H
> available: 4 nodes (0-3)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
> node 0 size: 128646 MB
> node 0 free: 127876 MB
> node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
> node 1 size: 129019 MB
> node 1 free: 128796 MB
> node 2 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
> node 2 size: 129019 MB
> node 2 free: 128755 MB
> node 3 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
> node 3 size: 127989 MB
> node 3 free: 127772 MB
> node distances:
> node   0   1   2   3
>   0:  10  16  32  33
>   1:  16  10  25  32
>   2:  32  25  10  16
>   3:  33  32  16  10
> 
> and I use below cmd to only use 16 tx/rx queue with 72 tx queue depth in
> order to trigger the tx queue stopping handling:
> ethtool -L eth4 combined 16
> ethtool -G eth4 tx 72
> 
> threads       unpatched          patched_v4    patched_v4+queue_stopped_opt
>    16      11% (si: 3.8%)     20% (si:13.5%)       11% (si:3.8%)
>    32      12% (si: 4.4%)     30% (si:22%)         13% (si:4.4%)
>    64      13% (si: 4.9%)     35% (si:26%)         13% (si:4.7%)
> 
> "11% (si: 3.8%)": 11% means the total cpu useage in node 0, and 3.8%
> means the softirq cpu useage .
> And thread number is as below iperf cmd:
> taskset -c 0-31 iperf -c 192.168.100.2 -t 1000000 -i 1 -P *thread*

The problem I see with this is that iperf's -P option only allows
running the test in multiple connections but they do not actually run in
multiple threads. Therefore this may not result in as much concurrency
as it seems.

Also, 100Gb/s ethernet is not so easy to saturate, trying 10Gb/s or
1Gb/s might put more pressure on qdisc code concurrency.

Michal

> It seems after applying the queue_stopped_opt patch, the cpu usage
> is closed to the unpatch one, at least with my testcase, maybe you
> can try your testcase with the queue_stopped_opt patch to see if
> it make any difference?

I will, I was not aware of v4 submission. I'll write a short note to it
so that it does not accidentally get applied before we know for sure
what the CPU usage impact is.

Michal