irqbalancer subset policy and CPU lock up on storage controller.

kashyap.desai@xxxxxxxxxxxxx (Kashyap Desai) · Mon, 12 Oct 2015 21:41:18 +0530

Hi Neil et all

I am Kashyap Desai working with Avago Technologies as Driver Developer. I
need some help to understand functionally of <irqbalancer> and
recommendation to fix certain issue associated with configuration of
<irqbalancer>.

I am seeing CPU soft lock up on my setup.

Below is detail of my setup -

1.   [root@]# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size: 65432 MB
node 0 free: 35910 MB
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node 1 size: 65536 MB
node 1 free: 33211 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

2.    Kernel - Oracle Linux 6 UEK. IRQ balance version  -
irqbalance-1.0.4-6.0.1.el6.x86_64

3.    Two Avago IT HBA Invader connected to local node Node-0.

4.    Default setting of above mentioned irqbalance - " -h subset".

My understating about above setting means -
Driver provided affinity hint will be used by irqlabalancer, but only for
local CPU NODE. For remote node, it will _not_ use driver provided
affinity hint.
In fact, it will redirect all IOs matching with MSIX vector of remote node
to local node's _only_ one CPU.   Below is snippet of CPU-MSIx vector
affinity.

        msix index = 0, irq number =  120, cpu affinity mask = 00000001
hint = 00000001
        msix index = 1, irq number =  121, cpu affinity mask = 00000002
hint = 00000002
        msix index = 2, irq number =  122, cpu affinity mask = 00000004
hint = 00000004
        msix index = 3, irq number =  123, cpu affinity mask = 00000008
hint = 00000008
        msix index = 4, irq number =  124, cpu affinity mask = 00000010
hint = 00000010
        msix index = 5, irq number =  125, cpu affinity mask = 00000020
hint = 00000020
        msix index = 6, irq number =  126, cpu affinity mask = 00000040
hint = 00000040
        msix index = 7, irq number =  127, cpu affinity mask = 00000080
hint = 00000080
        msix index = 8, irq number =  128, cpu affinity mask = 00ff00ff
hint = 00000100
        msix index = 9, irq number =  129, cpu affinity mask = 00ff00ff
hint = 00000200
        msix index = 10, irq number =  130, cpu affinity mask = 00ff00ff
hint = 00000400
        msix index = 11, irq number =  131, cpu affinity mask = 00ff00ff
hint = 00000800
        msix index = 12, irq number =  132, cpu affinity mask = 00ff00ff
hint = 00001000
        msix index = 13, irq number =  133, cpu affinity mask = 00ff00ff
hint = 00002000
        msix index = 14, irq number =  134, cpu affinity mask = 00ff00ff
hint = 00004000
        msix index = 15, irq number =  135, cpu affinity mask = 00ff00ff
hint = 00008000
        msix index = 16, irq number =  136, cpu affinity mask = 00010000
hint = 00010000
        msix index = 17, irq number =  137, cpu affinity mask = 00020000
hint = 00020000
        msix index = 18, irq number =  138, cpu affinity mask = 00040000
hint = 00040000
        msix index = 19, irq number =  139, cpu affinity mask = 00080000
hint = 00080000
        msix index = 20, irq number =  140, cpu affinity mask = 00100000
hint = 00100000
        msix index = 21, irq number =  141, cpu affinity mask = 00200000
hint = 00200000
        msix index = 22, irq number =  142, cpu affinity mask = 00400000
hint = 00400000
        msix index = 23, irq number =  143, cpu affinity mask = 00800000
hint = 00800000
        msix index = 24, irq number =  144, cpu affinity mask = 00ff00ff
hint = 01000000
        msix index = 25, irq number =  145, cpu affinity mask = 00ff00ff
hint = 02000000
        msix index = 26, irq number =  146, cpu affinity mask = 00ff00ff
hint = 04000000
        msix index = 27, irq number =  147, cpu affinity mask = 00ff00ff
hint = 08000000
        msix index = 28, irq number =  148, cpu affinity mask = 00ff00ff
hint = 10000000
        msix index = 29, irq number =  149, cpu affinity mask = 00ff00ff
hint = 20000000
        msix index = 30, irq number =  150, cpu affinity mask = 00ff00ff
hint = 40000000
        msix index = 31, irq number =  151, cpu affinity mask = 00ff00ff
hint = 80000000

Whenever IO is generated from Node-1 (this is not intentionally generated
IO load, but this is possible work load causing maximum negative impact on
IO performance and CPU lockup and other related issues),
all interrupt will be routed towards node-0 (logical CPU 0 only.). Because
of such work load, we see CPU-0 is 100% busy doing Hard IRQ and Soft IRQ
migration to other node (as rq_affinity is set to 1).

See snippet of CPU load from machine, when IO was submitted maximum time
from Node-1

05:30:38 AM     CPU      %usr     %nice      %sys   %iowait    %steal
%irq     %soft    %guest     %idle
05:30:39 AM     all      1.07      0.00      8.92     23.40      0.00
0.00     11.51      0.00     55.11
05:30:39 AM       0      0.00      0.00      0.00      0.00      0.00
0.00    100.00      0.00      0.00
<- This CPU is only busy doing HARD IRQ from mpt3sas IT driver.
05:30:39 AM       1      0.00      0.00      0.00      0.00      0.00
0.00      0.00      0.00    100.00
05:30:39 AM       2      0.00      0.00      0.00      0.00      0.00
0.00      0.00      0.00    100.00
05:30:39 AM       3      0.00      0.00      0.00      0.00      0.00
0.00      0.00      0.00    100.00
05:30:39 AM       4      0.00      0.00      0.00      0.00      0.00
0.00      0.00      0.00    100.00
05:30:39 AM       5      0.00      0.00      0.00      0.00      0.00
0.00      0.00      0.00    100.00
05:30:39 AM       6      0.00      0.00      0.00      0.00      0.00
0.00      0.00      0.00    100.00
05:30:39 AM       7      0.00      0.00      0.00      0.00      0.00
0.00      0.00      0.00    100.00
05:30:39 AM       8      0.00      0.00      2.13     85.11      0.00
0.00      1.06      0.00     11.70
05:30:39 AM       9      1.03      0.00      3.09     87.63      0.00
0.00      2.06      0.00      6.19
05:30:39 AM      10      2.06      0.00     13.40     74.23      0.00
0.00      8.25      0.00      2.06
05:30:39 AM      11      3.00      0.00     26.00     45.00      0.00
0.00     20.00      0.00      6.00
05:30:39 AM      12      3.06      0.00     26.53     42.86      0.00
0.00     20.41      0.00      7.14
05:30:39 AM      13      4.04      0.00     32.32     24.24      0.00
0.00     26.26      0.00     13.13
05:30:39 AM      14      4.12      0.00     31.96     26.80      0.00
0.00     28.87      0.00      8.25
05:30:39 AM      15      2.02      0.00     26.26     46.46      0.00
0.00     25.25      0.00      0.00
05:30:39 AM      16      0.00      0.00      0.00      0.00      0.00
0.00      0.00      0.00    100.00
05:30:39 AM      17      0.00      0.00      0.00      0.00      0.00
0.00      0.00      0.00    100.00
05:30:39 AM      18      0.00      0.00      0.00      0.00      0.00
0.00      0.00      0.00    100.00
05:30:39 AM      19      0.00      0.00      0.00      0.00      0.00
0.00      0.00      0.00    100.00
05:30:39 AM      20      0.00      0.00      0.00      0.00      0.00
0.00      0.00      0.00    100.00
05:30:39 AM      21      0.00      0.00      0.00      0.00      0.00
0.00      0.00      0.00    100.00
05:30:39 AM      22      0.00      0.00      0.00      0.00      0.00
0.00      0.00      0.00    100.00
05:30:39 AM      23      0.00      0.00      0.00      0.00      0.00
0.00      0.00      0.00    100.00
05:30:39 AM      24      2.00      0.00     14.00     22.00      0.00
0.00     15.00      0.00     47.00
05:30:39 AM      25      0.00      0.00      4.17      4.17      0.00
0.00      4.17      0.00     87.50
05:30:39 AM      26      0.00      0.00      5.21      8.33      0.00
0.00      5.21      0.00     81.25
05:30:39 AM      27      1.96      0.00     17.65     58.82      0.00
0.00     21.57      0.00      0.00
05:30:39 AM      28      1.04      0.00      9.38     81.25      0.00
0.00      8.33      0.00      0.00
05:30:39 AM      29      5.00      0.00     34.00     18.00      0.00
0.00     35.00      0.00      8.00
05:30:39 AM      30      3.92      0.00     32.35     19.61      0.00
0.00     35.29      0.00      8.82
05:30:39 AM      31      0.00      0.00      0.00    100.00      0.00
0.00      0.00      0.00      0.00

In this case, we observe CPU lockup as well as IO timeout. CPU lockup will
be always on CPU-0 as none of the other CPU is used by irqbalance to
server interrupt.

5. Above mentioned irqbalance version with - " -h ignore".

It means - driver provided affinity hint will _not_ be used by
irqlabalancer irrespective of locality of the node for PCI device.
Irqbalance will use its own algorithm to route IRQ. It will map each MSIX
to _one_ core (in this case two CPU thread as Hyper Thread is enabled)
from _local_ NUMA node of the device. Any IO generated from remote node
will be re-routed to _one_ specific core of local node. It is not like
above scenario where only cpu-0 was loaded.

6. Above mentioned irqbalance version with - " -h exact".

It means - driver provided affinity hint will be used by irqlabalancer for
both local and remote node.
Any IO generated from node will be re-routed to _one_ specific cpu of same
node. In fact, same logical cpu which was used for submission.

In our case, <ignore> and <subset> policy does not work because
<irqbalancer> is designed to consider NUMA node locality.  I read below
article and Neil Horman explained default policy of <irqbalancer> will be
moved to <ignore>
http://sourceforge.net/p/e1000/bugs/394/

<mpt3sas> driver follows same logic as ixgbe driver. We create multiple
<msix> vector depending upon logical CPU and assign one <msix vector> to
single <logical cpu>. We really do not wants those assignment to be Numa
Node biased.
What should be the solution if we really want to slow down IO submission
to avoid CPU lockup. We don't want only one CPU to keep busy for
completion.

Any suggestion ?

` Kashyap