Re: [PATCH] PCI: dwc: fix MSI IRQ handler ordering

Stephen Warren <swarren@xxxxxxxxxxxxx> · Mon, 7 Jan 2019 10:16:29 -0700

On 1/7/19 2:36 AM, Gustavo Pimentel wrote:
Hi Stephen,

On 04/01/2019 21:45, Stephen Warren wrote:
From: Stephen Warren <swarren@xxxxxxxxxx>

The current code does this when handling MSI IRQs:

a) Process the irq.
b) Clear the latched IRQ status.

If a new IRQ occurs any time after (a) has read the IRQ status for the
last time and before (b), it will be lost. For example, this occurs in
practice when using a Marvell 9171 AHCI controller with NCQ enabled;
many command timeouts occur with certain disk access patterns.

Fix the code to do the following instead, so that if any new IRQs are
raised during the processing of the IRQ, the IRQ status is not cleared,
so that the IRQ is not lost.

a) Clear the latched IRQ status.
b) Process the IRQ.

This change reverts commit 8c934095fa2f ("PCI: dwc: Clear MSI interrupt
status after it is handled, not before")

This change re-applies commit ca1658921b63 ("PCI: designware: Fix
missing MSI IRQs")

Signed-off-by: Stephen Warren <swarren@xxxxxxxxxx>
Cc: Arnd Bergmann <arnd@xxxxxxxx>
Cc: Bjorn Helgaas <bhelgaas@xxxxxxxxxx>
Cc: Faiz Abbas <faiz_abbas@xxxxxx>
Cc: Harro Haan <hrhaan@xxxxxxxxx>
Cc: Jingoo Han <jg1.han@xxxxxxxxxxx>
Cc: Joao Pinto <jpinto@xxxxxxxxxxxx>
Cc: Juergen Beisert <jbe@xxxxxxxxxxxxxx>
Cc: Marek Vasut <marex@xxxxxxx>
Cc: Matthias Mann <m.mann@xxxxxxxxxxxxxxxxxxxxxx>
Cc: Mohit Kumar <mohit.kumar@xxxxxx>
Cc: Pratyush Anand <pratyush.anand@xxxxxx>
Cc: Richard Zhu <hong-xing.zhu@xxxxxxxxxxxxx>
Cc: Sean Cross <xobs@xxxxxxxxxx>
Cc: Shawn Guo <shawn.guo@xxxxxxxxxx>
Cc: Siva Reddy Kallam <siva.kallam@xxxxxxxxxxx>
Cc: Srikanth T Shivanand <ts.srikanth@xxxxxxxxxxx>
Cc: Tim Harvey <tharvey@xxxxxxxxxxxxx>
---
Note: This issue was found in downstream NVIDIA 4.9 and 4.14 kernels.
However, the exact same code structure is present in mainline and I have
no reason to believe the problem would not reproduce there. I have
compile tested but not runtime tested it in mainline, since my board is
not yet supported in mainline.
---
  drivers/pci/controller/dwc/pcie-designware-host.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/controller/dwc/pcie-designware-host.c b/drivers/pci/controller/dwc/pcie-designware-host.c
index 692dd1b264fb..7fd6c56a6f35 100644
--- a/drivers/pci/controller/dwc/pcie-designware-host.c
+++ b/drivers/pci/controller/dwc/pcie-designware-host.c
@@ -98,10 +98,10 @@ irqreturn_t dw_handle_msi_irq(struct pcie_port *pp)
  			irq = irq_find_mapping(pp->irq_domain,
  					       (i * MAX_MSI_IRQS_PER_CTRL) +
  					       pos);
-			generic_handle_irq(irq);
  			dw_pcie_wr_own_conf(pp, PCIE_MSI_INTR0_STATUS +
  						(i * MSI_REG_CTRL_BLOCK_SIZE),
  					    4, 1 << pos);
+			generic_handle_irq(irq);
  			pos++;
  		}
  	}

This fix was already suggested by Trent Piepho, however, after review by Marc
Zyngier and some information obtained from Synopsys IP development team, the
following set of patches was generated to correct the problem.

830920e065e9 ("PCI: dwc: Use interrupt masking instead of disabling")
fce5423e4f43 ("PCI: dwc: Take lock when ACKing an interrupt")
3f7bb2ec20ce ("PCI: dwc: Move interrupt acking into the proper callback")

OK. I assume that dw_pci_bottom_ack() gets called before 
dw_handle_msi_irq() for each IRQ? In which case, the last of those 3 
changes is essentially identical to the patch I proposed, it's just that 
the acking is handled in a separate function rather than earlier in the 
same function.