On Tue, 13 Nov 2018 00:41:24 +0000, Gustavo Pimentel <gustavo.pimentel@xxxxxxxxxxxx> wrote: > > On 09/11/2018 11:34, Marc Zyngier wrote: > > On 08/11/18 19:49, Trent Piepho wrote: > >> On Thu, 2018-11-08 at 09:49 +0000, Marc Zyngier wrote: > >>> On 07/11/18 20:17, Trent Piepho wrote: > >>>> On Wed, 2018-11-07 at 18:41 +0000, Marc Zyngier wrote: > >>>>> On 06/11/18 19:40, Trent Piepho wrote: > >>>>>> > >>>>>> What about stable kernels that don't have the hierarchical API? > >>>>> > >>>>> My goal is to fix mainline first. Once we have something that works on > >>>>> mainline, we can look at propagating the fix to other versions. But > >>>>> mainline always comes first. > >>>> > >>>> This is a regression that went into 4.14. Wouldn't the appropriate > >>>> action for the stable series be to undo the regression? > >>> > >>> This is not how stable works. Stable kernels *only* contain patches that > >>> are backported from mainline, and do not take standalone patch. > >>> > >>> Furthermore, your fix is to actually undo someone else's fix. Who is > >>> right? In the absence of any documentation, the answer is "nobody". > >> > >> Little more history to this bug. The code was originally the way it is > >> now, but this same bug was fixed in 2013 in https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.kernel.or&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=bkWxpLoW-f-E3EdiDCCa0_h0PicsViasSlvIpzZvPxs&m=Tk2dtPxJ5WvHjQbWWx35od-JZmur8mRuDrSSVhf8nYs&s=3nFZMUc2qvOXJht1WuNCDorudHeFiWyeDaO1UbhTXmw&e= > >> g/patch/3333681/ > >> > >> Then that lasted four years until it was changed Aug 2017 in https://urldefense.proofpoint.com/v2/url?u=https-3A__pa&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=bkWxpLoW-f-E3EdiDCCa0_h0PicsViasSlvIpzZvPxs&m=Tk2dtPxJ5WvHjQbWWx35od-JZmur8mRuDrSSVhf8nYs&s=yfWJAoFIGjLbTnuE_KLzhW7J_pyMD3X1dXm4eHoy7_o&e= > >> tchwork.kernel.org/patch/9893303/ > >> > >> That lasted just six months until someone tried to revert it, https://urldefense.proofpoint.com/v2/url?u=https-3A__p&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=bkWxpLoW-f-E3EdiDCCa0_h0PicsViasSlvIpzZvPxs&m=Tk2dtPxJ5WvHjQbWWx35od-JZmur8mRuDrSSVhf8nYs&s=q1LGEb9w_CjcwmWVFOM5VWJ17oS9XHd8SsWTsBF8zfU&e= > >> atchwork.kernel.org/patch/9893303/ > >> > >> Seems pretty clear the way it is now is much worse than the way it was > >> before, even if the previous design may have had another flaw. Though > >> I've yet to see anyone point out something makes the previous design > >> broken. Sub-optimal yes, but not broken. > > > > This is not what was reported by the previous submitter. I guess they > > changed this for a reason, no? I'm prepared to admit this is a end-point > > driver bug, but we need to understand what is happening and stop > > changing this driver randomly. > > > >>> Anything can be backported to stable once we understand the issue. At > >>> the moment, we're just playing games moving stuff around and hope > >>> nothing else will break. That's not a sustainable way of maintaining > >>> this driver. At the moment, the only patch I'm inclined to propose until > >>> we get an actual interrupt handling flow from Synopsys is to mark this > >>> driver as "BROKEN". > >> > >> It feels like you're using this bug to hold designware hostage in a > >> broken kernel, and me along with them. I don't have the documentation, > >> no one does, there's no way for me to give you want you want. But I've > >> got hardware that doesn't work in the mainline kernel. > > > > I take it as a personal offence that I'd be holding anything or anyone > > hostage. I think I have a long enough track record working with the > > Linux kernel not to take any of this nonsense. What's my interest in > > keeping anything in this sorry state? Think about it for a minute. > > > > When I'm asking for documentation, I'm not asking you. I perfectly > > understood that you don't have any. We need Synopsys to step up and give > > us a simple description of what the MSI workflow is. Because no matter > > how you randomly mutate this code, it will still be broken until we get > > a clear answer from *Synopsys*. > > > > Gustavo, here's one simple ask. Please reply to this email with a step > > by step, register by register description of how an MSI must be handled > > on this HW. We do need it, and we need it now. > > Hi Marc, I thought that I did respond to your question on Nov 8. However I will > repeat it and add some extra information to it now. > > About the registers: > > PCIE_MSI_INTR0_MASK > When an MSI is received for a masked interrupt, the corresponding status bit > gets set in the interrupt status register but the controller will not signal it. > As soon as the masked interrupt is unmasked and assuming the status bit is still > set the controller will signal it. So this is *really* a mask. > PCIE_MSI_INTR0_ENABLE > Enables/disables every interrupt. When an MSI is received from a disabled > interrupt, no status bit gets set in MSI controller interrupt status register. And that's not a mask at all. This allows the interrupt to be simply lost. It is just a bit worrying that this is exactly what the DWC driver is using, and faces all kind of interrupt loss when enabling/disabling interrupts (which do have a mask/unmask semantic -- yes, the language is confusing). > > About this new callbacks: > > The dw_pci_bottom_mask() and dw_pci_bottom_unmask() callbacks were added on > Linux kernel v4.17 on PCI: dwc: Move MSI IRQs allocation to IRQ domains > hierarchical API patch [1] and based on what I've seen so far, before this patch > there were no interrupt masking been performed. > > Based on the information provided, its very likely that I should use the > PCIE_MSI_INTR0_MASK register instead of PCIE_MSI_INTR0_ENABLE on the > dw_pci_bottom_mask() and dw_pci_bottom_unmask() callbacks. As soon as I return > from Linux plumbers conference, I will test the system using the > PCIE_MSI_INTR0_MASK register. I'm at Plumbers as well, so feel free to grab me and we'll sort this out. We've wasted enough time on this broken driver. Thanks, M. -- Jazz is not dead, it just smell funny.