PCI IRQ Affinity Infrastructure question with BLK/SCSI-MQ

Himanshu Madhani <hmadhani@xxxxxxxxxxx> · Fri, 21 Dec 2018 22:48:10 +0000

Hi Christoph,

We are facing an issue with masked MSIX vectors received while trying to get pci vectors when BLK/SCSI-MQ is enabled when number of CPUs are lesser than the available MSIX vectors. For our ISP25xx chipset, hardware supports 32 MSIX vectors with MQ enabled. We originally found this issue on system using RH8.0 kernel which is at 4.19 version. The system that failed has 12 CPUs and maximum MSIX vectors requested were 32.
We observed with new pci_alloc_irq_affinity() callback driver is returning 32 vectors when system has only 12 CPUs. As far as we understand, this call should have returned maximum 14 MSIX vectors (12 for CPU affinity + 2 reserved in .pre_vectors of irq_affinity structure). Also, we see that vectors returned include masked ones. Since driver received 32 vectors, We create 30 qpairs (2 less for reserved). In this scenario, we observed that on some qpairs, driver is not able to process interrupt because CPUs are masked at the PCI layer. Looking at the code, we noticed that ‘pre/post’ vectors sets in struct irq_affinity don’t appear to help here.

>From below call we should get only online_cpus() + reserved number of vectors back while requesting number of vectors, instead we get back numbers that driver requested.

int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
                                   unsigned int max_vecs, unsigned int flags,
                                   const struct irq_affinity *affd)
{
        if (flags & PCI_IRQ_MSIX) {
                vecs = __pci_enable_msix_range(dev, NULL, min_vecs, max_vecs,
                                affd);
                if (vecs > 0)
                        return vecs;
        }
}

static int __pci_enable_msix_range(struct pci_dev *dev,
                                   struct msix_entry *entries, int minvec,
                                   int maxvec, const struct irq_affinity *affd)
{
        for (;;) {
                if (affd) {
                        nvec = irq_calc_affinity_vectors(minvec, nvec, affd);
                        if (nvec < minvec)
                                return -ENOSPC;
                }
}

Which in-turn calls irq_calc_affinity_vectors(), Which should return min of num_online_cpus() + resv

/**
* irq_calc_affinity_vectors - Calculate the optimal number of vectors
* @minvec:     The minimum number of vectors available
* @maxvec:     The maximum number of vectors available
* @affd:       Description of the affinity requirements
*/
int irq_calc_affinity_vectors(int minvec, int maxvec, const struct irq_affinity *affd)
{
        int resv = affd->pre_vectors + affd->post_vectors;
        int vecs = maxvec - resv;
        int ret;
        if (resv > minvec)
                return 0;
        get_online_cpus();
        ret = min_t(int, cpumask_weight(cpu_possible_mask), vecs) + resv;
        put_online_cpus();
        return ret;
}

We do see the same using 4.20.0-rc6 kernel. See below table, we experimented by forcing maxcpu parameter to expose lower number of CPUs than vectors requested.

Upstream - 4.20-rc6         

                              MaxCPU=     Cores       Result    
      MQ Enabled  ISP25xx     Unset       48          Pass
      MQ Enabled  ISP25xx     2           24          Failed
      MQ Enabled  ISP25xx     4           30          Failed
      MQ Enabled  ISP27xx     Unset       48          Pass
      MQ Enabled  ISP27xx     2           24          Failed
      MQ Enabled  ISP27xx     4           30          Failed

Note that RH8.0 kernel which has the code from 4.19 kernel behaves the same way. We have not be able to do extensive testing with SLES.
We want to make sure we are reading this code right and our understanding is right. If not, please advise the right expectations and what changes are needed to address this.

In case our understanding is right, whether we have any known issue in this area in 4.19 kernel which got addressed in 4.20-rc6 kernel. If yes, can you please point us to the commit message. If not, what additional data is needed to debug this further. We have captured PCIe trace and ruled out any issues at hardware/firmware level and we also see that the MSIX vector associate with the queue pair where we are not getting interrupts is masked.

We want to understand how to calculate IRQ vectors that driver can request in such scenario.

Thanks,
Himanshu