Hi Neil et all I am Kashyap Desai working with Avago Technologies as Driver Developer. I need some help to understand functionally of <irqbalancer> and recommendation to fix certain issue associated with configuration of <irqbalancer>. I am seeing CPU soft lock up on my setup. Below is detail of my setup - 1. [root@]# numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23 node 0 size: 65432 MB node 0 free: 35910 MB node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 node 1 size: 65536 MB node 1 free: 33211 MB node distances: node 0 1 0: 10 21 1: 21 10 2. Kernel - Oracle Linux 6 UEK. IRQ balance version - irqbalance-1.0.4-6.0.1.el6.x86_64 3. Two Avago IT HBA Invader connected to local node Node-0. 4. Default setting of above mentioned irqbalance - " -h subset". My understating about above setting means - Driver provided affinity hint will be used by irqlabalancer, but only for local CPU NODE. For remote node, it will _not_ use driver provided affinity hint. In fact, it will redirect all IOs matching with MSIX vector of remote node to local node's _only_ one CPU. Below is snippet of CPU-MSIx vector affinity. msix index = 0, irq number = 120, cpu affinity mask = 00000001 hint = 00000001 msix index = 1, irq number = 121, cpu affinity mask = 00000002 hint = 00000002 msix index = 2, irq number = 122, cpu affinity mask = 00000004 hint = 00000004 msix index = 3, irq number = 123, cpu affinity mask = 00000008 hint = 00000008 msix index = 4, irq number = 124, cpu affinity mask = 00000010 hint = 00000010 msix index = 5, irq number = 125, cpu affinity mask = 00000020 hint = 00000020 msix index = 6, irq number = 126, cpu affinity mask = 00000040 hint = 00000040 msix index = 7, irq number = 127, cpu affinity mask = 00000080 hint = 00000080 msix index = 8, irq number = 128, cpu affinity mask = 00ff00ff hint = 00000100 msix index = 9, irq number = 129, cpu affinity mask = 00ff00ff hint = 00000200 msix index = 10, irq number = 130, cpu affinity mask = 00ff00ff hint = 00000400 msix index = 11, irq number = 131, cpu affinity mask = 00ff00ff hint = 00000800 msix index = 12, irq number = 132, cpu affinity mask = 00ff00ff hint = 00001000 msix index = 13, irq number = 133, cpu affinity mask = 00ff00ff hint = 00002000 msix index = 14, irq number = 134, cpu affinity mask = 00ff00ff hint = 00004000 msix index = 15, irq number = 135, cpu affinity mask = 00ff00ff hint = 00008000 msix index = 16, irq number = 136, cpu affinity mask = 00010000 hint = 00010000 msix index = 17, irq number = 137, cpu affinity mask = 00020000 hint = 00020000 msix index = 18, irq number = 138, cpu affinity mask = 00040000 hint = 00040000 msix index = 19, irq number = 139, cpu affinity mask = 00080000 hint = 00080000 msix index = 20, irq number = 140, cpu affinity mask = 00100000 hint = 00100000 msix index = 21, irq number = 141, cpu affinity mask = 00200000 hint = 00200000 msix index = 22, irq number = 142, cpu affinity mask = 00400000 hint = 00400000 msix index = 23, irq number = 143, cpu affinity mask = 00800000 hint = 00800000 msix index = 24, irq number = 144, cpu affinity mask = 00ff00ff hint = 01000000 msix index = 25, irq number = 145, cpu affinity mask = 00ff00ff hint = 02000000 msix index = 26, irq number = 146, cpu affinity mask = 00ff00ff hint = 04000000 msix index = 27, irq number = 147, cpu affinity mask = 00ff00ff hint = 08000000 msix index = 28, irq number = 148, cpu affinity mask = 00ff00ff hint = 10000000 msix index = 29, irq number = 149, cpu affinity mask = 00ff00ff hint = 20000000 msix index = 30, irq number = 150, cpu affinity mask = 00ff00ff hint = 40000000 msix index = 31, irq number = 151, cpu affinity mask = 00ff00ff hint = 80000000 Whenever IO is generated from Node-1 (this is not intentionally generated IO load, but this is possible work load causing maximum negative impact on IO performance and CPU lockup and other related issues), all interrupt will be routed towards node-0 (logical CPU 0 only.). Because of such work load, we see CPU-0 is 100% busy doing Hard IRQ and Soft IRQ migration to other node (as rq_affinity is set to 1). See snippet of CPU load from machine, when IO was submitted maximum time from Node-1 05:30:38 AM CPU %usr %nice %sys %iowait %steal %irq %soft %guest %idle 05:30:39 AM all 1.07 0.00 8.92 23.40 0.00 0.00 11.51 0.00 55.11 05:30:39 AM 0 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 <- This CPU is only busy doing HARD IRQ from mpt3sas IT driver. 05:30:39 AM 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 05:30:39 AM 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 05:30:39 AM 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 05:30:39 AM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 05:30:39 AM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 05:30:39 AM 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 05:30:39 AM 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 05:30:39 AM 8 0.00 0.00 2.13 85.11 0.00 0.00 1.06 0.00 11.70 05:30:39 AM 9 1.03 0.00 3.09 87.63 0.00 0.00 2.06 0.00 6.19 05:30:39 AM 10 2.06 0.00 13.40 74.23 0.00 0.00 8.25 0.00 2.06 05:30:39 AM 11 3.00 0.00 26.00 45.00 0.00 0.00 20.00 0.00 6.00 05:30:39 AM 12 3.06 0.00 26.53 42.86 0.00 0.00 20.41 0.00 7.14 05:30:39 AM 13 4.04 0.00 32.32 24.24 0.00 0.00 26.26 0.00 13.13 05:30:39 AM 14 4.12 0.00 31.96 26.80 0.00 0.00 28.87 0.00 8.25 05:30:39 AM 15 2.02 0.00 26.26 46.46 0.00 0.00 25.25 0.00 0.00 05:30:39 AM 16 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 05:30:39 AM 17 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 05:30:39 AM 18 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 05:30:39 AM 19 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 05:30:39 AM 20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 05:30:39 AM 21 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 05:30:39 AM 22 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 05:30:39 AM 23 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 05:30:39 AM 24 2.00 0.00 14.00 22.00 0.00 0.00 15.00 0.00 47.00 05:30:39 AM 25 0.00 0.00 4.17 4.17 0.00 0.00 4.17 0.00 87.50 05:30:39 AM 26 0.00 0.00 5.21 8.33 0.00 0.00 5.21 0.00 81.25 05:30:39 AM 27 1.96 0.00 17.65 58.82 0.00 0.00 21.57 0.00 0.00 05:30:39 AM 28 1.04 0.00 9.38 81.25 0.00 0.00 8.33 0.00 0.00 05:30:39 AM 29 5.00 0.00 34.00 18.00 0.00 0.00 35.00 0.00 8.00 05:30:39 AM 30 3.92 0.00 32.35 19.61 0.00 0.00 35.29 0.00 8.82 05:30:39 AM 31 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00 0.00 In this case, we observe CPU lockup as well as IO timeout. CPU lockup will be always on CPU-0 as none of the other CPU is used by irqbalance to server interrupt. 5. Above mentioned irqbalance version with - " -h ignore". It means - driver provided affinity hint will _not_ be used by irqlabalancer irrespective of locality of the node for PCI device. Irqbalance will use its own algorithm to route IRQ. It will map each MSIX to _one_ core (in this case two CPU thread as Hyper Thread is enabled) from _local_ NUMA node of the device. Any IO generated from remote node will be re-routed to _one_ specific core of local node. It is not like above scenario where only cpu-0 was loaded. 6. Above mentioned irqbalance version with - " -h exact". It means - driver provided affinity hint will be used by irqlabalancer for both local and remote node. Any IO generated from node will be re-routed to _one_ specific cpu of same node. In fact, same logical cpu which was used for submission. In our case, <ignore> and <subset> policy does not work because <irqbalancer> is designed to consider NUMA node locality. I read below article and Neil Horman explained default policy of <irqbalancer> will be moved to <ignore> http://sourceforge.net/p/e1000/bugs/394/ <mpt3sas> driver follows same logic as ixgbe driver. We create multiple <msix> vector depending upon logical CPU and assign one <msix vector> to single <logical cpu>. We really do not wants those assignment to be Numa Node biased. What should be the solution if we really want to slow down IO submission to avoid CPU lockup. We don't want only one CPU to keep busy for completion. Any suggestion ? ` Kashyap