Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq

Ming Lei <ming.lei@xxxxxxxxxx> · Tue, 6 Feb 2018 16:04:58 +0800

Hi Kashyap,

On Tue, Feb 06, 2018 at 11:33:50AM +0530, Kashyap Desai wrote:
> > > We still have more than one reply queue ending up completion one CPU.
> >
> > pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) has to be used, that means
> > smp_affinity_enable has to be set as 1, but seems it is the default
> setting.
> >
> > Please see kernel/irq/affinity.c, especially irq_calc_affinity_vectors()
> which
> > figures out an optimal number of vectors, and the computation is based
> on
> > cpumask_weight(cpu_possible_mask) now. If all offline CPUs are mapped to
> > some of reply queues, these queues won't be active(no request submitted
> to
> > these queues). The mechanism of PCI_IRQ_AFFINITY basically makes sure
> that
> > more than one irq vector won't be handled by one same CPU, and the irq
> > vector spread is done in irq_create_affinity_masks().
> >
> > > Try to reduce MSI-x vector of megaraid_sas or mpt3sas driver via
> > > module parameter to simulate the issue. We need more number of Online
> > > CPU than reply-queue.
> >
> > IMO, you don't need to simulate the issue, pci_alloc_irq_vectors(
> > PCI_IRQ_AFFINITY) will handle that for you. You can dump the returned
> irq
> > vector number, num_possible_cpus()/num_online_cpus() and each irq
> > vector's affinity assignment.
> >
> > > We may see completion redirected to original CPU because of
> > > "QUEUE_FLAG_SAME_FORCE", but ISR of low level driver can keep one CPU
> > > busy in local ISR routine.
> >
> > Could you dump each irq vector's affinity assignment of your megaraid in
> your
> > test?
> 
> To quickly reproduce, I restricted to single MSI-x vector on megaraid_sas
> driver.  System has total 16 online CPUs.

I suggest you don't do the restriction of single MSI-x vector, and just
use the device supported number of msi-x vector.

> 
> Output of affinity hints.
> kernel version:
> Linux rhel7.3 4.15.0-rc1+ #2 SMP Mon Feb 5 12:13:34 EST 2018 x86_64 x86_64
> x86_64 GNU/Linux
> PCI name is 83:00.0, dump its irq affinity:
> irq 105, cpu list 0-3,8-11

In this case, which CPU is selected for handling the interrupt is
decided by interrupt controller, and it is easy to cause CPU overload
if interrupt controller always selects one same CPU to handle the irq.

> 
> Affinity mask is created properly, but only CPU-0 is overloaded with
> interrupt processing.
> 
> # numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 8 9 10 11
> node 0 size: 47861 MB
> node 0 free: 46516 MB
> node 1 cpus: 4 5 6 7 12 13 14 15
> node 1 size: 64491 MB
> node 1 free: 62805 MB
> node distances:
> node   0   1
>   0:  10  21
>   1:  21  10
> 
> Output of  system activities (sar).  (gnice is 100% and it is consumed in
> megaraid_sas ISR routine.)
> 
> 
> 12:44:40 PM     CPU      %usr     %nice      %sys   %iowait    %steal
> %irq     %soft    %guest    %gnice     %idle
> 12:44:41 PM     all         6.03      0.00        29.98      0.00
> 0.00         0.00        0.00        0.00        0.00         63.99
> 12:44:41 PM       0         0.00      0.00         0.00        0.00
> 0.00         0.00        0.00        0.00       100.00         0
> 
> 
> In my test, I used rq_affinity is set to 2. (QUEUE_FLAG_SAME_FORCE). I
> also used " host_tagset" V2 patch set for megaraid_sas.
> 
> Using RFC requested in -
> "https://marc.info/?l=linux-scsi&m=151601833418346&w=2 " lockup is avoided
> (you can noticed that gnice is shifted to softirq. Even though it is 100%
> consumed, There is always exit for existing completion loop due to
> irqpoll_weight  @irq_poll_init().
> 
> Average:        CPU      %usr     %nice      %sys   %iowait    %steal
> %irq     %soft    %guest    %gnice     %idle
> Average:        all          4.25      0.00        21.61      0.00
> 0.00      0.00         6.61           0.00      0.00     67.54
> Average:          0           0.00      0.00         0.00      0.00
> 0.00      0.00       100.00        0.00      0.00      0.00
> 
> 
> Hope this clarifies. We need different fix to avoid lockups. Can we
> consider using irq poll interface if #CPU is more than Reply queue/MSI-x.
> ?

Please use the device's supported msi-x vectors number, and see if there
is this issue. If there is, you can use irq poll too, which isn't contradictory
with the blk-mq approach taken by this patchset.

Hope I clarifies too, :-)

Thanks, 
Ming