On Tue, Aug 15, 2017 at 03:48:25PM -0500, Bjorn Helgaas wrote: > On Mon, Aug 14, 2017 at 06:11:23PM -0400, Keith Busch wrote: > > On Mon, Aug 14, 2017 at 03:59:48PM -0500, Bjorn Helgaas wrote: > > > On Tue, Aug 01, 2017 at 03:11:52AM -0400, Keith Busch wrote: > > > > We've encountered a particular platform that under some circumstances > > > > always has the power fault detected status raised. The pciehp irq handler > > > > would loop forever because it thinks it is handling new events when in > > > > fact the power fault is not new. This patch fixes that by masking off > > > > the power fault status from new events if the driver hasn't seen the > > > > power fault clear from the previous handling attempt. > > > > > > Can you say which platform this is? If this is a hardware defect, > > > it'd be interesting to know where it happens. > > > > > > But I'm not sure we handle PCI_EXP_SLTSTA correctly. We basically > > > have this: > > > > > > pciehp_isr() > > > { > > > pcie_capability_read_word(pdev, PCI_EXP_SLTSTA, &status); > > > events = status & (<events we care about>); > > > pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, events); > > > <queue event handling> > > > } > > > > > > The write to PCI_EXP_SLTSTA clears PCI_EXP_SLTSTA_PFD because it's > > > RW1C. But we haven't done anything that would actually change the > > > situation that caused a power fault, so I don't think it would be > > > surprising if the hardware immediately reasserted it. > > > > > > So maybe this continual assertion of power fault is really a software > > > bug, not a hardware problem? > > > > I *think* it's a software bug for the exact reason you provided, but I'm > > sure it must be isolated to certain conditions with certain hardware. We'd > > have heard about this regression during 4.9 if it was more wide-spread. > > OK, I think this makes pretty good sense. > > > > > Fixes: fad214b0aa72 ("PCI: pciehp: Process all hotplug events before > > > > looking for new ones") > > > This is on a PEX8733 bridge, and it reports power fault detected status as > > long as the power fault exists. While we can write 1 to clear the event, > > that just rearms the port to retrigger power fault detected status for as > > long as the power controller detects its faulted. The status is cleared > > for good only when the power fault no longer exists rather than when > > it is acknowledged. The spec seems to support that view (Table (7-21: > > Slot Status Register): > > > > Power Fault Detected – If a Power Controller that supports power fault > > detection is implemented, this bit is Set when the Power Controller > > detects a power fault at this slot. > > Interesting: 5651c48cfafe ("PCI pciehp: fix power fault interrupt > storm problem") turned off PCI_EXP_SLTCTL_PFDE in 2009 for basically > the same problem. > > AFAICS, we *still* never turn PCI_EXP_SLTCTL_PFDE on, so in your case, > you're probably taking an interrupt for some other reason, and > coincidentally noticing that PCI_EXP_SLTSTA_PFD is set. > > Maybe it's time to turn PCI_EXP_SLTCTL_PFDE back on along with your > fix to prevent the loop? It seems like kind of a hole that we will > never notice power faults unless some other interrupt happens. Or am > I missing something? > > I'd like to move your PCI_EXP_SLTSTA_PFD out on its own so there's > room for a comment. What do you think of the following? I really > don't think this is specific to a particular platform (other than > maybe timing-wise), so I dropped that from the changelog too. > > > commit 7612b3b28c0b900dcbcdf5e9b9747cc20a1e2455 > Author: Keith Busch <keith.busch@xxxxxxxxx> > Date: Tue Aug 1 03:11:52 2017 -0400 > > PCI: pciehp: Report power fault only once until we clear it > > When a power fault occurs, the power controller sets Power Fault Detected > in the Slot Status register, and pciehp_isr() queues an INT_POWER_FAULT > event to handle it. > > It also clears Power Fault Detected, but since nothing has yet changed to > correct the power fault, the power controller will likely set it again > immediately, which may cause an infinite loop when pcie_isr() rechecks > Slot Status. > > Fix that by masking off Power Fault Detected from new events if the driver > hasn't seen the power fault clear from the previous handling attempt. > > Fixes: fad214b0aa72 ("PCI: pciehp: Process all hotplug events before looking for new ones") > Signed-off-by: Keith Busch <keith.busch@xxxxxxxxx> > [bhelgaas: changelog, pull test out and add comment] > Signed-off-by: Bjorn Helgaas <bhelgaas@xxxxxxxxxx> > Cc: Mayurkumar Patel <mayurkumar.patel@xxxxxxxxx> > Cc: stable@xxxxxxxxxxxxxxx # 4.9+ > > diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c > index 026830a138ae..e5d5ce9e3010 100644 > --- a/drivers/pci/hotplug/pciehp_hpc.c > +++ b/drivers/pci/hotplug/pciehp_hpc.c > @@ -586,6 +586,14 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id) > events = status & (PCI_EXP_SLTSTA_ABP | PCI_EXP_SLTSTA_PFD | > PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_CC | > PCI_EXP_SLTSTA_DLLSC); > + > + /* > + * If we've already reported a power fault, don't report it again > + * until we've done something to handle it. > + */ > + if (ctrl->power_fault_detected) > + events &= ~PCI_EXP_SLTSTA_PFD; > + > if (!events) > return IRQ_NONE; > This is on pci/hotplug for v4.14.