[RFC 0/2] mpt3sas/megaraid_sas : irq poll and load balancing of reply queue

Kashyap Desai <kashyap.desai@xxxxxxxxxxxx> · Mon, 15 Jan 2018 17:42:05 +0530

Hi All -

We have seen cpu lock up issue from fields if system has greater (more
than 96) logical cpu count.
SAS3.0 controller (Invader series) supports at max 96 msix vector and
SAS3.5 product (Ventura) supports at max 128 msix vectors.

This may be a generic issue (if PCI device support  completion on multiple
reply queues). Let me explain it w.r.t to mpt3sas supported h/w just to
simplify the problem and possible changes to handle such issues. IT HBA
(mpt3sas) supports multiple reply queues in completion path. Driver
creates MSI-x vectors for controller as "min of ( FW supported Reply
queue, Logical CPUs)". If submitter is not interrupted via completion on
same CPU, there is a loop in the IO path. This behavior can cause
hard/soft CPU lockups, IO timeout, system sluggish etc.

Example - one CPU (e.g. CPU A) is busy submitting the IOs and another CPU
(e.g. CPU B) is busy with processing the corresponding IO's reply
descriptors from reply descriptor queue upon receiving the interrupts from
HBA. If the CPU A is continuously pumping the IOs then always CPU B (which
is executing the ISR) will see the valid reply descriptors in the reply
descriptor queue and it will be continuously processing those reply
descriptor in a loop without quitting the ISR handler.  Mpt3sas driver
will exit ISR handler if it finds unused reply descriptor in the reply
descriptor queue. Since CPU A will be continuously sending the IOs, CPU B
may always see a valid reply descriptor (posted by HBA Firmware after
processing the IO) in the reply descriptor queue. In worst case, driver
will not quit from this loop in the ISR handler. Eventually, CPU lockup
will be detected by watchdog.

Above mentioned behavior is not common if "rq_affinity" set to 2 or
affinity_hint is honored by irqbalance as "exact".
If rq_affinity is set to 2, submitter will be always interrupted via
completion on same CPU.
If irqbalance is using "exact" policy, interrupt will be delivered to
submitter CPU.

Problem statement -
If CPU counts to MSI-X vectors (reply descriptor Queues) count ratio is
not 1:1, we still have  exposure of issue explained above and for that we
don't have any solution.

Exposure of soft/hard lockup if CPU count is more than MSI-x supported by
device.

If CPUs count to MSI-x vectors count ratio is not 1:1, (Other way, if CPU
counts to MSI-x vector count ratio is something like X:1, where X > 1)
then 'exact' irqbalance policy OR rq_affinity = 2 won't help to avoid CPU
hard/soft lockups. There won't be any one to one mapping between CPU to
MSI-x vector instead one MSI-x interrupt (or reply descriptor queue) is
shared with group/set of CPUs and there is a possibility of having a loop
in the IO path within that CPU group and may observe lockups.

For example: Consider a system having two NUMA nodes and each node having
four logical CPUs and also consider that number of MSI-x vectors enabled
on the HBA is two, then CPUs count to MSI-x vector count ratio as 4:1.
e.g.
MSIx vector 0 is affinity to  CPU 0, CPU 1, CPU 2 & CPU 3 of NUMA node 0
and MSI-x vector 1 is affinity to CPU 4, CPU 5, CPU 6 & CPU 7 of NUMA node
1.

numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3                                                -->
MSI-x 0
node 0 size: 65536 MB
node 0 free: 63176 MB
node 1 cpus: 4 5 6 7
-->MSI-x 1
node 1 size: 65536 MB
node 1 free: 63176 MB

Assume that user started an application which uses all the CPUs of NUMA
node 0 for issuing the IOs.
Only one CPU from affinity list (it can be any cpu since this behavior
depends upon irqbalance) CPU0 will receive the interrupts from MSIx vector
0 for all the IOs. Eventually, CPU 0 IO submission percentage will be
decreasing and ISR processing percentage will be increasing as it is more
busy with processing the interrupts. Gradually IO submission percentage on
CPU 0 will be zero and it's ISR processing percentage will be 100
percentage as IO loop has already formed within the NUMA node 0, i.e. CPU
1, CPU 2 & CPU 3 will be continuously busy with submitting the heavy IOs
and only CPU 0 is busy in the ISR path as it always find the valid reply
descriptor in the reply descriptor queue. Eventually, we will observe the
hard lockup here.

Chances of occurring of hard/soft lockups are directly proportional to
value of X. If value of X is high, then chances of observing CPU lockups
is high.

Solution -
Fix - 1 Use IRQ poll interface defined in " irq_poll.c". mpt3sas driver
will execute ISR routine in Softirq context and it will always quit the
loop based on budget provided in IRQ poll interface.

In these scenarios( i.e. where CPUs count to MSI-X vectors count ratio is
X:1 (where X >  1)),  IRQ poll interface will avoid CPU hard lockups due
to voluntary exit from the reply queue processing based on budget.  Note -
Only one MSI-x vector is busy doing processing. Irqstat ouput -

IRQs / 1 second(s)
IRQ#  TOTAL  NODE0   NODE1   NODE2   NODE3  NAME
  44    122871   122871   0       0       0  IR-PCI-MSI-edge
mpt3sas0-msix0
  45        0              0           0       0       0  IR-PCI-MSI-edge
mpt3sas0-msix1

Fix-2 - Above fix will avoid lockups, but there can be some performance
issue if very few reply queue is busy. Driver should round robin the reply
queue, so that each reply queue is load balanced.  Irqstat ouput after
driver does reply queue load balance-

IRQs / 1 second(s)
IRQ#  TOTAL  NODE0   NODE1   NODE2   NODE3  NAME
  44  62871  62871       0       0       0  IR-PCI-MSI-edge mpt3sas0-msix0
  45  62718  62718       0       0       0  IR-PCI-MSI-edge mpt3sas0-msix1

In Summary,
CPU completing IO which is not contributing to IO submission, may cause
cpu lockup.
If CPUs count to MSI-X vector count ratio is X:1 (where X > 1) then using
irq poll interface, we can avoid the CPU lockups and by equally
distributing the interrupts among the enabled MSI-x interrupts we can
avoid performance issues.

We are planning to use both the fixes only if cpu count is more than FW
supported MSI-x vector.
Please review and provide your feedback. I have appended both the patches.

Thanks, Kashyap