Re: [PATCH] PCI: dwc: Fix interrupt race in when handling MSI

Marc Zyngier <marc.zyngier@xxxxxxx> · Tue, 13 Nov 2018 14:40:39 +0000

On Tue, 13 Nov 2018 00:41:24 +0000,
Gustavo Pimentel <gustavo.pimentel@xxxxxxxxxxxx> wrote:
> 
> On 09/11/2018 11:34, Marc Zyngier wrote:
> > On 08/11/18 19:49, Trent Piepho wrote:
> >> On Thu, 2018-11-08 at 09:49 +0000, Marc Zyngier wrote:
> >>> On 07/11/18 20:17, Trent Piepho wrote:
> >>>> On Wed, 2018-11-07 at 18:41 +0000, Marc Zyngier wrote:
> >>>>> On 06/11/18 19:40, Trent Piepho wrote:
> >>>>>>
> >>>>>> What about stable kernels that don't have the hierarchical API?
> >>>>>
> >>>>> My goal is to fix mainline first. Once we have something that works on
> >>>>> mainline, we can look at propagating the fix to other versions. But
> >>>>> mainline always comes first.
> >>>>
> >>>> This is a regression that went into 4.14.  Wouldn't the appropriate
> >>>> action for the stable series be to undo the regression?
> >>>
> >>> This is not how stable works. Stable kernels *only* contain patches that
> >>> are backported from mainline, and do not take standalone patch.
> >>>
> >>> Furthermore, your fix is to actually undo someone else's fix. Who is
> >>> right? In the absence of any documentation, the answer is "nobody".
> >>
> >> Little more history to this bug.  The code was originally the way it is
> >> now, but this same bug was fixed in 2013 in https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.kernel.or&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=bkWxpLoW-f-E3EdiDCCa0_h0PicsViasSlvIpzZvPxs&m=Tk2dtPxJ5WvHjQbWWx35od-JZmur8mRuDrSSVhf8nYs&s=3nFZMUc2qvOXJht1WuNCDorudHeFiWyeDaO1UbhTXmw&e=
> >> g/patch/3333681/
> >>
> >> Then that lasted four years until it was changed Aug 2017 in https://urldefense.proofpoint.com/v2/url?u=https-3A__pa&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=bkWxpLoW-f-E3EdiDCCa0_h0PicsViasSlvIpzZvPxs&m=Tk2dtPxJ5WvHjQbWWx35od-JZmur8mRuDrSSVhf8nYs&s=yfWJAoFIGjLbTnuE_KLzhW7J_pyMD3X1dXm4eHoy7_o&e=
> >> tchwork.kernel.org/patch/9893303/
> >>
> >> That lasted just six months until someone tried to revert it, https://urldefense.proofpoint.com/v2/url?u=https-3A__p&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=bkWxpLoW-f-E3EdiDCCa0_h0PicsViasSlvIpzZvPxs&m=Tk2dtPxJ5WvHjQbWWx35od-JZmur8mRuDrSSVhf8nYs&s=q1LGEb9w_CjcwmWVFOM5VWJ17oS9XHd8SsWTsBF8zfU&e=
> >> atchwork.kernel.org/patch/9893303/
> >>
> >> Seems pretty clear the way it is now is much worse than the way it was
> >> before, even if the previous design may have had another flaw.  Though
> >> I've yet to see anyone point out something makes the previous design
> >> broken.  Sub-optimal yes, but not broken.
> > 
> > This is not what was reported by the previous submitter. I guess they
> > changed this for a reason, no? I'm prepared to admit this is a end-point
> > driver bug, but we need to understand what is happening and stop
> > changing this driver randomly.
> > 
> >>> Anything can be backported to stable once we understand the issue. At
> >>> the moment, we're just playing games moving stuff around and hope
> >>> nothing else will break. That's not a sustainable way of maintaining
> >>> this driver. At the moment, the only patch I'm inclined to propose until
> >>> we get an actual interrupt handling flow from Synopsys is to mark this
> >>> driver as "BROKEN".
> >>
> >> It feels like you're using this bug to hold designware hostage in a
> >> broken kernel, and me along with them.  I don't have the documentation,
> >> no one does, there's no way for me to give you want you want.  But I've
> >> got hardware that doesn't work in the mainline kernel.
> > 
> > I take it as a personal offence that I'd be holding anything or anyone
> > hostage. I think I have a long enough track record working with the
> > Linux kernel not to take any of this nonsense. What's my interest in
> > keeping anything in this sorry state? Think about it for a minute.
> > 
> > When I'm asking for documentation, I'm not asking you. I perfectly
> > understood that you don't have any. We need Synopsys to step up and give
> > us a simple description of what the MSI workflow is. Because no matter
> > how you randomly mutate this code, it will still be broken until we get
> > a clear answer from *Synopsys*.
> > 
> > Gustavo, here's one simple ask. Please reply to this email with a step
> > by step, register by register description of how an MSI must be handled
> > on this HW. We do need it, and we need it now.
> 
> Hi Marc, I thought that I did respond to your question on Nov 8. However I will
> repeat it and add some extra information to it now.
> 
> About the registers:
> 
> PCIE_MSI_INTR0_MASK
> When an MSI is received for a masked interrupt, the corresponding status bit
> gets set in the interrupt status register but the controller will not signal it.
> As soon as the masked interrupt is unmasked and assuming the status bit is still
> set the controller will signal it.

So this is *really* a mask.

> PCIE_MSI_INTR0_ENABLE
> Enables/disables every interrupt. When an MSI is received from a disabled
> interrupt, no status bit gets set in MSI controller interrupt status register.

And that's not a mask at all. This allows the interrupt to be simply
lost. It is just a bit worrying that this is exactly what the DWC
driver is using, and faces all kind of interrupt loss when
enabling/disabling interrupts (which do have a mask/unmask semantic --
yes, the language is confusing).

> 
> About this new callbacks:
> 
> The dw_pci_bottom_mask() and dw_pci_bottom_unmask() callbacks were added on
> Linux kernel v4.17 on PCI: dwc: Move MSI IRQs allocation to IRQ domains
> hierarchical API patch [1] and based on what I've seen so far, before this patch
> there were no interrupt masking been performed.
> 
> Based on the information provided, its very likely that I should use the
> PCIE_MSI_INTR0_MASK register instead of PCIE_MSI_INTR0_ENABLE on the
> dw_pci_bottom_mask() and dw_pci_bottom_unmask() callbacks. As soon as I return
> from Linux plumbers conference, I will test the system using the
> PCIE_MSI_INTR0_MASK register.

I'm at Plumbers as well, so feel free to grab me and we'll sort this
out. We've wasted enough time on this broken driver.

Thanks,

	M.

-- 
Jazz is not dead, it just smell funny.