Re: PCI IRQ Affinity Infrastructure question with BLK/SCSI-MQ

Ming Lei <ming.lei@xxxxxxxxxx> · Sun, 23 Dec 2018 09:26:15 +0800

Hi Humanshu,

Please set your email line break as 72 or 80, otherwise it is quite
hard to reply in line.

On Fri, Dec 21, 2018 at 10:48:10PM +0000, Himanshu Madhani wrote:
> Hi Christoph,
> 
> We are facing an issue with masked MSIX vectors received while trying to get pci vectors when BLK/SCSI-MQ is enabled when number of CPUs are lesser than the available MSIX vectors. For our ISP25xx chipset, hardware supports 32 MSIX vectors with MQ enabled. We originally found this issue on system using RH8.0 kernel which is at 4.19 version. The system that failed has 12 CPUs and maximum MSIX vectors requested were 32.

pci_alloc_irq_affinity() returns 32 on qla2xxx driver, which is expected
behaviour, and nr_possible_cpus is 32 on your system actually too.

> We observed with new pci_alloc_irq_affinity() callback driver is returning 32 vectors when system has only 12 CPUs. As far as we understand, this call should have returned maximum 14 MSIX vectors (12 for CPU affinity + 2 reserved in .pre_vectors of irq_affinity structure). Also, we see that vectors returned include masked ones. Since driver received 32 vectors, We create 30 qpairs (2 less for reserved). In this scenario, we observed that on some qpairs, driver is not able to process interrupt because CPUs are masked at the PCI layer. Looking at the code, we noticed that ‘pre/post’ vectors sets in struct irq_affinity don’t appear to help here.
>

Yes, the .pre_vectors is 2, that means 30 PCI_IRQ_AFFINITY IO vectors is
returned, which is still correct behaviour, it isn't 32 because the
system may run out of irq vectors.

Especially in this case, you only have 12 online CPUs, and it is enough for 1
vector to handle IO from one CPU.

> From below call we should get only online_cpus() + reserved number of vectors back while requesting number of vectors, instead we get back numbers that driver requested.
> 

No, that isn't correct, in theory it is fine for pci_alloc_irq_vectors_affinity()
to return any number of irq vectors, which depends on available irq
vectors. Especially it is workable to return reserved vecotor(.pre_vectors plus
.post_vectors) and >= 1 PCI_IRQ_AFFINITY IO vector.

> int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
>                                    unsigned int max_vecs, unsigned int flags,
>                                    const struct irq_affinity *affd)
> {
>         if (flags & PCI_IRQ_MSIX) {
>                 vecs = __pci_enable_msix_range(dev, NULL, min_vecs, max_vecs,
>                                 affd);
>                 if (vecs > 0)
>                         return vecs;
>         }
> }
>  
> static int __pci_enable_msix_range(struct pci_dev *dev,
>                                    struct msix_entry *entries, int minvec,
>                                    int maxvec, const struct irq_affinity *affd)
> {
>         for (;;) {
>                 if (affd) {
>                         nvec = irq_calc_affinity_vectors(minvec, nvec, affd);
>                         if (nvec < minvec)
>                                 return -ENOSPC;
>                 }
> }
> 
> Which in-turn calls irq_calc_affinity_vectors(), Which should return min of num_online_cpus() + resv
>  
> /**
> * irq_calc_affinity_vectors - Calculate the optimal number of vectors
> * @minvec:     The minimum number of vectors available
> * @maxvec:     The maximum number of vectors available
> * @affd:       Description of the affinity requirements
> */
> int irq_calc_affinity_vectors(int minvec, int maxvec, const struct irq_affinity *affd)
> {
>         int resv = affd->pre_vectors + affd->post_vectors;
>         int vecs = maxvec - resv;
>         int ret;
>         if (resv > minvec)
>                 return 0;
>         get_online_cpus();
>         ret = min_t(int, cpumask_weight(cpu_possible_mask), vecs) + resv;
>         put_online_cpus();
>         return ret;
> }
> 
> We do see the same using 4.20.0-rc6 kernel. See below table, we experimented by forcing maxcpu parameter to expose lower number of CPUs than vectors requested.
> 
> Upstream - 4.20-rc6         
>  
>                               MaxCPU=     Cores       Result    
>       MQ Enabled  ISP25xx     Unset       48          Pass
>       MQ Enabled  ISP25xx     2           24          Failed
>       MQ Enabled  ISP25xx     4           30          Failed
>       MQ Enabled  ISP27xx     Unset       48          Pass
>       MQ Enabled  ISP27xx     2           24          Failed
>       MQ Enabled  ISP27xx     4           30          Failed
>       
> Note that RH8.0 kernel which has the code from 4.19 kernel behaves the same way. We have not be able to do extensive testing with SLES.
> We want to make sure we are reading this code right and our understanding is right. If not, please advise the right expectations and what changes are needed to address this.
> 
> In case our understanding is right, whether we have any known issue in this area in 4.19 kernel which got addressed in 4.20-rc6 kernel. If yes, can you please point us to the commit message. If not, what additional data is needed to debug this further. We have captured PCIe trace and ruled out any issues at hardware/firmware level and we also see that the MSIX vector associate with the queue pair where we are not getting interrupts is masked.
> 
> We want to understand how to calculate IRQ vectors that driver can request in such scenario.

The irq vector allocation isn't wrong, and your IO hang is probably
caused by not using the correct msix vector (qpair/hardware queue).

For example, there are 30 IO vectors returned, and the mapping between
CPU and IO vector may be something like below, you have to double check
if the correct msix vector is used.

CPU 0 ~ 11 is online, so only irq 45~55 & 57 should be used, you can see
which CPU is originated from for each request via rq->mq_ctx->cpu, and the
mapping is done via blk-mq automatically, especially blk_mq_unique_tag_to_hwq(tag)
may tell you which hardware queue is mapped, and you can figured out which
msix vector should be used for this hardware queue.

	irq 45, cpu list 0
	irq 46, cpu list 1
	irq 47, cpu list 2
	irq 48, cpu list 3
	irq 49, cpu list 4
	irq 50, cpu list 5
	irq 51, cpu list 6
	irq 52, cpu list 7
	irq 53, cpu list 8
	irq 54, cpu list 9
	irq 55, cpu list 10
	irq 57, cpu list 11
	irq 58, cpu list 12-13
	irq 59, cpu list 14-15
	irq 60, cpu list 16
	irq 61, cpu list 17
	irq 62, cpu list 18
	irq 63, cpu list 19
	irq 64, cpu list 20
	irq 65, cpu list 21
	irq 66, cpu list 22
	irq 67, cpu list 23
	irq 68, cpu list 24
	irq 69, cpu list 25
	irq 70, cpu list 26
	irq 71, cpu list 27
	irq 72, cpu list 28
	irq 73, cpu list 29
	irq 74, cpu list 30
	irq 75, cpu list 31

thanks,
Ming