Hi Martin, Any update on these patches. Thanks, Suganath On Fri, Feb 15, 2019 at 1:10 PM Suganath Prabu <suganath-prabu.subramani@xxxxxxxxxxxx> wrote: > > We have seen cpu lock up issue from fields if > system has greater (more than 96) logical cpu count. > SAS3.0 controller (Invader series) supports at > max 96 msix vector and SAS3.5 product (Ventura) > supports at max 128 msix vectors. > > This may be a generic issue (if PCI device support > completion on multiple reply queues). > Let me explain it w.r.t to mpt3sas supported h/w > just to simplify the problem and possible changes to > handle such issues. IT HBA (mpt3sas) supports > multiple reply queues in completion path. Driver > creates MSI-x vectors for controller as "min of > (FW supported Reply queue, Logical CPUs)". If submitter > is not interrupted via completion on same CPU, there is > a loop in the IO path. This behavior can cause > hard/soft CPU lockups, IO timeout, system sluggish etc. > > Example - one CPU (e.g. CPU A) is busy submitting the IOs > and another CPU (e.g. CPU B) is busy with processing the > corresponding IO's reply descriptors from reply > descriptor queue upon receiving the interrupts from HBA. > If the CPU A is continuously pumping the IOs then always > CPU B (which is executing the ISR) will see the valid > reply descriptors in the reply descriptor queue and it > will be continuously processing those reply descriptor > in a loop without quitting the ISR handler. > > Mpt3sas driver will exit ISR handler if it finds unused > reply descriptor in the reply descriptor queue. Since > CPU A will be continuously sending the IOs, CPU B may > always see a valid reply descriptor > (posted by HBA Firmware after processing the IO) in the > reply descriptor queue. In worst case, driver will not > quit from this loop in the ISR handler. Eventually, > CPU lockup will be detected by watchdog. > > Above mentioned behavior is not common if "rq_affinity" > set to 2 or affinity_hint is honored by > irqbalance as "exact". > If rq_affinity is set to 2, submitter will be always > interrupted via completion on same CPU. > If irqbalance is using "exact" policy, > interrupt will be delivered to submitter CPU. > > Problem statement - > If CPU counts to MSI-X vectors (reply descriptor Queues) > count ratio is not 1:1, we still have exposure of issue > explained above and for that we don't have any solution. > > Exposure of soft/hard lockup if CPU count is more > than MSI-x supported by device. > > If CPUs count to MSI-x vectors count ratio is not 1:1, > (Other way, if CPU counts to MSI-x vector count ratio is > something like X:1, where X > 1) then 'exact' irqbalance > policy OR rq_affinity = 2 won't help to avoid CPU > hard/soft lockups. There won't be any one to one mapping > between CPU to MSI-x vector instead one MSI-x interrupt > (or reply descriptor queue) is shared with group/set of > CPUs and there is a possibility of having a loop in the > IO path within that CPU group and may observe lockups. > > For example: Consider a system having two NUMA nodes and > each node having four logical CPUs and also consider that > number of MSI-x vectors enabled on the HBA is two, then > CPUs count to MSI-x vector count ratio as 4:1. > e.g. > MSIx vector 0 is affinity to CPU 0, CPU 1, CPU 2 & CPU 3 > of NUMA node 0 and MSI-x vector 1 is affinity to CPU 4, > CPU 5, CPU 6 & CPU 7 of NUMA node 1. > > numactl --hardware > available: 2 nodes (0-1) > node 0 cpus: 0 1 2 3 --> MSI-x 0 > node 0 size: 65536 MB > node 0 free: 63176 MB > node 1 cpus: 4 5 6 7 -->MSI-x 1 > node 1 size: 65536 MB > node 1 free: 63176 MB > > Assume that user started an application which uses > all the CPUs of NUMA node 0 for issuing the IOs. > Only one CPU from affinity list (it can be any cpu since > this behavior depends upon irqbalance) CPU0 will receive the > interrupts from MSIx vector 0 for all the IOs. Eventually, > CPU 0 IO submission percentage will be decreasing and ISR > processing percentage will be increasing as it is more busy > with processing the interrupts. > Gradually IO submission percentage on CPU 0 will be zero > and it's ISR processing percentage will be 100 percentage as > IO loop has already formed within the NUMA node 0, > i.e. CPU 1, CPU 2 & CPU 3 will be continuously busy with > submitting the heavy IOs and only CPU 0 is busy in the ISR > path as it always find the valid reply descriptor in the > reply descriptor queue. Eventually, we will observe the > hard lockup here. > > Chances of occurring of hard/soft lockups are directly > proportional to value of X. If value of X is high, > then chances of observing CPU lockups is high. > > Solution - > > Fix-1 > ===== > Use IRQ poll interface defined in " irq_poll.c". > mpt3sas driver will execute ISR routine in Softirq context > and it will always quit the loop based on budget provided in > IRQ poll interface. > > In these scenarios( i.e. where CPUs count to MSI-X vectors > count ratio is X:1 (where X > 1)), IRQ poll interface > will avoid CPU hard lockups due to voluntary exit from > the reply queue processing based on budget. > Note - Only one MSI-x vector is busy doing processing. > > Irqstat output - > > IRQs / 1 second(s) > IRQ# TOTAL NODE0 NODE1 NODE2 NODE3 NAME > 44 122871 122871 0 0 0 IR-PCI-MSI-edge mpt3sas0-msix0 > 45 0 0 0 0 0 IR-PCI-MSI-edge mpt3sas0-msix1 > > Fix-2 > ===== > Driver should round robin the reply queue, so that each > reply queue is load balanced. > so that IO's are distributed to all the available > reply descriptor post queues equally. > With this each reply descriptor post queue load is balanced. > This improves performance and also fixes soft lockups. > > Irqstat output after driver does reply queue load balance- > > Irqstat output - > > IRQs / 1 second(s) > IRQ# TOTAL NODE0 NODE1 NODE2 NODE3 NAME > 44 62871 62871 0 0 0 IR-PCI-MSI-edge mpt3sas0-msix0 > 45 62718 62718 0 0 0 IR-PCI-MSI-edge mpt3sas0-msix1 > > In Summary, > CPU completing IO which is not contributing to > IO submission, may cause cpu lockup. > If CPUs count to MSI-X vector count ratio is X:1 (where X > 1) > then using irq poll interface, we can avoid the CPU lockups and > by equally distributing the interrupts among the enabled MSI-x > interrupts we can avoid performance issues. > > Patch 3 & 4 addresses Fix 1 and Fix 2 explained > above, only if cpu count is more than FW supported MSI-x vector. > > Patch V1 changeset. > Added patch 3 to add select irqpoll (Kconfig). > > Suganath Prabu (7): > mpt3sas: Fix typo in request_desript_type. > mpt3sas: simplify interrupt handler. > mpt3sas: Select IRQ_POLL to avoid build error. > mpt3sas: Irq poll to avoid CPU hard lockups. > mpt3sas: Load balance to improve performance and avoid soft lockups. > mpt3sas: Improve the threshold value and introduce module param. > mpt3sas: Update mpt3sas driver version to 28.100.00.00 > > drivers/scsi/mpt3sas/Kconfig | 1 + > drivers/scsi/mpt3sas/mpt3sas_base.c | 178 ++++++++++++++++++++++++++++++------ > drivers/scsi/mpt3sas/mpt3sas_base.h | 22 ++++- > 3 files changed, 170 insertions(+), 31 deletions(-) > > -- > 1.8.3.1 >