> -----Original Message----- > From: Ming Lei [mailto:ming.lei@xxxxxxxxxx] > Sent: Friday, February 2, 2018 3:44 PM > To: Kashyap Desai > Cc: linux-scsi@xxxxxxxxxxxxxxx; Peter Rivera > Subject: Re: [RFC 0/2] mpt3sas/megaraid_sas : irq poll and load balancing of > reply queue > > Hi Kashyap, > > On Mon, Jan 15, 2018 at 05:42:05PM +0530, Kashyap Desai wrote: > > Hi All - > > > > We have seen cpu lock up issue from fields if system has greater (more > > than 96) logical cpu count. > > SAS3.0 controller (Invader series) supports at max 96 msix vector and > > SAS3.5 product (Ventura) supports at max 128 msix vectors. > > > > This may be a generic issue (if PCI device support completion on > > multiple reply queues). Let me explain it w.r.t to mpt3sas supported > > h/w just to simplify the problem and possible changes to handle such > > issues. IT HBA > > (mpt3sas) supports multiple reply queues in completion path. Driver > > creates MSI-x vectors for controller as "min of ( FW supported Reply > > queue, Logical CPUs)". If submitter is not interrupted via completion > > on same CPU, there is a loop in the IO path. This behavior can cause > > hard/soft CPU lockups, IO timeout, system sluggish etc. > > As I mentioned in another thread, this issue may be solved by SCSI_MQ via > mapping reply queue into hctx of blk_mq, together with > QUEUE_FLAG_SAME_FORCE, especially you have set 'smp_affinity_enable' as > 1 at default already, then pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) can do IRQ > vectors spread on CPUs perfectly for you. > > But the following Hannes's patch is required for the conversion. > > https://marc.info/?l=linux-block&m=149130770004507&w=2 > Hi Ming - I gone through thread discussing "support host-wide tagset". Below Link has latest reply on that thread. https://marc.info/?l=linux-block&m=149132580511346&w=2 I think, there is a confusion over mpt3sas and megaraid_sas h/w behavior. Broadcom/LSI HBA and MR h/w has only one h/w queue for submission but there are multiple reply queue. Even though I include Hannes' patch for host-side tagset, problem described in this RFC will not be resolved. In fact, tagset can also provide same results if completion queue is less than online CPU. Don't you think ? OR I am missing anything ? We don't have problem in submission path. Current problem is MSI-x to more than one CPU can cause I/O loop. This is visible, if we have higher number of online CPUs. > > > > Example - one CPU (e.g. CPU A) is busy submitting the IOs and another > > CPU (e.g. CPU B) is busy with processing the corresponding IO's reply > > descriptors from reply descriptor queue upon receiving the interrupts > > from HBA. If the CPU A is continuously pumping the IOs then always CPU > > B (which is executing the ISR) will see the valid reply descriptors in > > the reply descriptor queue and it will be continuously processing > > those reply descriptor in a loop without quitting the ISR handler. > > Mpt3sas driver will exit ISR handler if it finds unused reply > > descriptor in the reply descriptor queue. Since CPU A will be > > continuously sending the IOs, CPU B may always see a valid reply > > descriptor (posted by HBA Firmware after processing the IO) in the > > reply descriptor queue. In worst case, driver will not quit from this > > loop in the ISR handler. Eventually, CPU lockup will be detected by > watchdog. > > > > Above mentioned behavior is not common if "rq_affinity" set to 2 or > > affinity_hint is honored by irqbalance as "exact". > > If rq_affinity is set to 2, submitter will be always interrupted via > > completion on same CPU. > > If irqbalance is using "exact" policy, interrupt will be delivered to > > submitter CPU. > > Now you have used pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) to get msix > vector number, the irq affinity can't be changed by userspace any more. > > > > > Problem statement - > > If CPU counts to MSI-X vectors (reply descriptor Queues) count ratio > > is not 1:1, we still have exposure of issue explained above and for > > that we don't have any solution. > > > > Exposure of soft/hard lockup if CPU count is more than MSI-x supported > > by device. > > > > If CPUs count to MSI-x vectors count ratio is not 1:1, (Other way, if > > CPU counts to MSI-x vector count ratio is something like X:1, where X > > > 1) then 'exact' irqbalance policy OR rq_affinity = 2 won't help to > > avoid CPU hard/soft lockups. There won't be any one to one mapping > > between CPU to MSI-x vector instead one MSI-x interrupt (or reply > > descriptor queue) is shared with group/set of CPUs and there is a > > possibility of having a loop in the IO path within that CPU group and may > observe lockups. > > > > For example: Consider a system having two NUMA nodes and each node > > having four logical CPUs and also consider that number of MSI-x > > vectors enabled on the HBA is two, then CPUs count to MSI-x vector count > ratio as 4:1. > > e.g. > > MSIx vector 0 is affinity to CPU 0, CPU 1, CPU 2 & CPU 3 of NUMA node > > 0 and MSI-x vector 1 is affinity to CPU 4, CPU 5, CPU 6 & CPU 7 of > > NUMA node 1. > > > > numactl --hardware > > available: 2 nodes (0-1) > > node 0 cpus: 0 1 2 3 --> > > MSI-x 0 > > node 0 size: 65536 MB > > node 0 free: 63176 MB > > node 1 cpus: 4 5 6 7 > > -->MSI-x 1 > > node 1 size: 65536 MB > > node 1 free: 63176 MB > > > > Assume that user started an application which uses all the CPUs of > > NUMA node 0 for issuing the IOs. > > Only one CPU from affinity list (it can be any cpu since this behavior > > depends upon irqbalance) CPU0 will receive the interrupts from MSIx > > vector > > 0 for all the IOs. Eventually, CPU 0 IO submission percentage will be > > decreasing and ISR processing percentage will be increasing as it is > > more busy with processing the interrupts. Gradually IO submission > > percentage on CPU 0 will be zero and it's ISR processing percentage > > will be 100 percentage as IO loop has already formed within the NUMA > > node 0, i.e. CPU 1, CPU 2 & CPU 3 will be continuously busy with > > submitting the heavy IOs and only CPU 0 is busy in the ISR path as it > > always find the valid reply descriptor in the reply descriptor queue. > > Eventually, we will observe the hard lockup here. > > > > Chances of occurring of hard/soft lockups are directly proportional to > > value of X. If value of X is high, then chances of observing CPU > > lockups is high. > > > > Solution - > > Fix - 1 Use IRQ poll interface defined in " irq_poll.c". mpt3sas > > driver will execute ISR routine in Softirq context and it will always > > quit the loop based on budget provided in IRQ poll interface. > > > > In these scenarios( i.e. where CPUs count to MSI-X vectors count ratio > > is > > X:1 (where X > 1)), IRQ poll interface will avoid CPU hard lockups > > due to voluntary exit from the reply queue processing based on budget. > > Note - Only one MSI-x vector is busy doing processing. Irqstat ouput - > > > > IRQs / 1 second(s) > > IRQ# TOTAL NODE0 NODE1 NODE2 NODE3 NAME > > 44 122871 122871 0 0 0 IR-PCI-MSI-edge > > mpt3sas0-msix0 > > 45 0 0 0 0 0 IR-PCI-MSI-edge > > mpt3sas0-msix1 > > > > Fix-2 - Above fix will avoid lockups, but there can be some > > performance issue if very few reply queue is busy. Driver should round > > robin the reply queue, so that each reply queue is load balanced. > > Irqstat ouput after driver does reply queue load balance- > > > > IRQs / 1 second(s) > > IRQ# TOTAL NODE0 NODE1 NODE2 NODE3 NAME > > 44 62871 62871 0 0 0 IR-PCI-MSI-edge mpt3sas0-msix0 > > 45 62718 62718 0 0 0 IR-PCI-MSI-edge mpt3sas0-msix1 > > > > In Summary, > > CPU completing IO which is not contributing to IO submission, may > > cause cpu lockup. > > If CPUs count to MSI-X vector count ratio is X:1 (where X > 1) then > > using irq poll interface, we can avoid the CPU lockups and by equally > > distributing the interrupts among the enabled MSI-x interrupts we can > > avoid performance issues. > > > > We are planning to use both the fixes only if cpu count is more than > > FW supported MSI-x vector. > > Please review and provide your feedback. I have appended both the > patches. > > > > Please take a look at pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) and > SCSI_MQ/blk_mq, you issue can be solved without much difficulty. > > One annoying thing is that SCSI driver has to support both MQ and non-MQ > path. Long time ago, I submitted patch to support force-MQ in driver, but it is > rejected. > > Thanks, > Ming