Re: [v1 0/7] Irq poll to address cpu lockup.

Suganath Prabu Subramani <suganath-prabu.subramani@xxxxxxxxxxxx> · Thu, 14 Mar 2019 12:23:18 +0530

Hi Martin,

Any update on these patches.

Thanks,
Suganath

On Fri, Feb 15, 2019 at 1:10 PM Suganath Prabu
<suganath-prabu.subramani@xxxxxxxxxxxx> wrote:
>
> We have seen cpu lock up issue from fields if
> system has greater (more than 96) logical cpu count.
> SAS3.0 controller (Invader series) supports at
> max 96 msix vector and SAS3.5 product (Ventura)
> supports at max 128 msix vectors.
>
> This may be a generic issue (if PCI device support
> completion on multiple reply queues).
> Let me explain it w.r.t to mpt3sas supported h/w
> just to simplify the problem and possible changes to
> handle such issues. IT HBA (mpt3sas) supports
> multiple reply queues in completion path. Driver
> creates MSI-x vectors for controller as "min of
> (FW supported Reply queue, Logical CPUs)". If submitter
> is not interrupted via completion on same CPU, there is
> a loop in the IO path. This behavior can cause
> hard/soft CPU lockups, IO timeout, system sluggish etc.
>
> Example - one CPU (e.g. CPU A) is busy submitting the IOs
> and another CPU (e.g. CPU B) is busy with processing the
> corresponding IO's reply descriptors from reply
> descriptor queue upon receiving the interrupts from HBA.
> If the CPU A is continuously pumping the IOs then always
> CPU B (which is executing the ISR) will see the valid
> reply descriptors in the reply descriptor queue and it
> will be continuously processing those reply descriptor
> in a loop without quitting the ISR handler.
>
> Mpt3sas driver will exit ISR handler if it finds unused
> reply descriptor in the reply descriptor queue. Since
> CPU A will be continuously sending the IOs, CPU B may
> always see a valid reply descriptor
> (posted by HBA Firmware after processing the IO) in the
> reply descriptor queue. In worst case, driver will not
> quit from this loop in the ISR handler. Eventually,
> CPU lockup will be detected by watchdog.
>
> Above mentioned behavior is not common if "rq_affinity"
> set to 2 or affinity_hint is honored by
> irqbalance as "exact".
> If rq_affinity is set to 2, submitter will be always
> interrupted via completion on same CPU.
> If irqbalance is using "exact" policy,
> interrupt will be delivered to submitter CPU.
>
> Problem statement -
> If CPU counts to MSI-X vectors (reply descriptor Queues)
> count ratio is not 1:1, we still have exposure of issue
> explained above and for that we don't have any solution.
>
> Exposure of soft/hard lockup if CPU count is more
> than MSI-x supported by device.
>
> If CPUs count to MSI-x vectors count ratio is not 1:1,
> (Other way, if CPU counts to MSI-x vector count ratio is
> something like X:1, where X > 1) then 'exact' irqbalance
> policy OR rq_affinity = 2 won't help to avoid CPU
> hard/soft lockups. There won't be any one to one mapping
> between CPU to MSI-x vector instead one MSI-x interrupt
> (or reply descriptor queue) is shared with group/set of
> CPUs and there is a possibility of having a loop in the
> IO path within that CPU group and may observe lockups.
>
> For example: Consider a system having two NUMA nodes and
> each node having four logical CPUs and also consider that
> number of MSI-x vectors enabled on the HBA is two, then
> CPUs count to MSI-x vector count ratio as 4:1.
> e.g.
> MSIx vector 0 is affinity to  CPU 0, CPU 1, CPU 2 & CPU 3
> of NUMA node 0 and MSI-x vector 1 is affinity to CPU 4,
> CPU 5, CPU 6 & CPU 7 of NUMA node 1.
>
> numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3                 --> MSI-x 0
> node 0 size: 65536 MB
> node 0 free: 63176 MB
> node 1 cpus: 4 5 6 7                 -->MSI-x 1
> node 1 size: 65536 MB
> node 1 free: 63176 MB
>
> Assume that user started an application which uses
> all the CPUs of NUMA node 0 for issuing the IOs.
> Only one CPU from affinity list (it can be any cpu since
> this behavior depends upon irqbalance) CPU0 will receive the
> interrupts from MSIx vector 0 for all the IOs. Eventually,
> CPU 0 IO submission percentage will be decreasing and ISR
> processing percentage will be increasing as it is more busy
> with processing the interrupts.
> Gradually IO submission percentage on CPU 0 will be zero
> and it's ISR processing percentage will be 100 percentage as
> IO loop has already formed within the NUMA node 0,
> i.e. CPU 1, CPU 2 & CPU 3 will be continuously busy with
> submitting the heavy IOs and only CPU 0 is busy in the ISR
> path as it always find the valid reply descriptor in the
> reply descriptor queue. Eventually, we will observe the
> hard lockup here.
>
> Chances of occurring of hard/soft lockups are directly
> proportional to value of X. If value of X is high,
> then chances of observing CPU lockups is high.
>
> Solution -
>
> Fix-1
> =====
> Use IRQ poll interface defined in " irq_poll.c".
> mpt3sas driver will execute ISR routine in Softirq context
> and it will always quit the loop based on budget provided in
> IRQ poll interface.
>
> In these scenarios( i.e. where CPUs count to MSI-X vectors
> count ratio is X:1 (where X >  1)), IRQ poll interface
> will avoid CPU hard lockups due to voluntary exit from
> the reply queue processing based on budget.
> Note - Only one MSI-x vector is busy doing processing.
>
> Irqstat output -
>
> IRQs / 1 second(s)
> IRQ#  TOTAL  NODE0   NODE1   NODE2   NODE3  NAME
>   44    122871   122871   0       0       0  IR-PCI-MSI-edge mpt3sas0-msix0
>   45        0              0           0       0       0  IR-PCI-MSI-edge mpt3sas0-msix1
>
> Fix-2
> =====
> Driver should round robin the reply queue, so that each
> reply queue is load balanced.
> so that IO's are distributed to all the available
> reply descriptor post queues equally.
> With this each reply descriptor post queue load is balanced.
> This improves performance and also fixes soft lockups.
>
> Irqstat output after driver does reply queue load balance-
>
> Irqstat output -
>
> IRQs / 1 second(s)
> IRQ#  TOTAL  NODE0   NODE1   NODE2   NODE3  NAME
>   44  62871  62871       0       0       0  IR-PCI-MSI-edge mpt3sas0-msix0
>   45  62718  62718       0       0       0  IR-PCI-MSI-edge mpt3sas0-msix1
>
> In Summary,
> CPU completing IO which is not contributing to
> IO submission, may cause cpu lockup.
> If CPUs count to MSI-X vector count ratio is X:1 (where X > 1)
> then using irq poll interface, we can avoid the CPU lockups and
> by equally distributing the interrupts among the enabled MSI-x
> interrupts we can avoid performance issues.
>
> Patch 3 & 4 addresses Fix 1 and Fix 2 explained
> above, only if cpu count is more than FW supported MSI-x vector.
>
> Patch V1 changeset.
> Added patch 3 to add select irqpoll (Kconfig).
>
> Suganath Prabu (7):
>   mpt3sas: Fix typo in request_desript_type.
>   mpt3sas: simplify interrupt handler.
>   mpt3sas: Select IRQ_POLL to avoid build error.
>   mpt3sas: Irq poll to avoid CPU hard lockups.
>   mpt3sas: Load balance to improve performance and avoid soft lockups.
>   mpt3sas: Improve the threshold value and introduce module param.
>   mpt3sas: Update mpt3sas driver version to 28.100.00.00
>
>  drivers/scsi/mpt3sas/Kconfig        |   1 +
>  drivers/scsi/mpt3sas/mpt3sas_base.c | 178 ++++++++++++++++++++++++++++++------
>  drivers/scsi/mpt3sas/mpt3sas_base.h |  22 ++++-
>  3 files changed, 170 insertions(+), 31 deletions(-)
>
> --
> 1.8.3.1
>