Re: [PATCH] PCI: dwc: Fix interrupt race in when handling MSI

Gustavo Pimentel <gustavo.pimentel@xxxxxxxxxxxx> · Tue, 13 Nov 2018 00:41:24 +0000

On 09/11/2018 11:34, Marc Zyngier wrote:
> On 08/11/18 19:49, Trent Piepho wrote:
>> On Thu, 2018-11-08 at 09:49 +0000, Marc Zyngier wrote:
>>> On 07/11/18 20:17, Trent Piepho wrote:
>>>> On Wed, 2018-11-07 at 18:41 +0000, Marc Zyngier wrote:
>>>>> On 06/11/18 19:40, Trent Piepho wrote:
>>>>>>
>>>>>> What about stable kernels that don't have the hierarchical API?
>>>>>
>>>>> My goal is to fix mainline first. Once we have something that works on
>>>>> mainline, we can look at propagating the fix to other versions. But
>>>>> mainline always comes first.
>>>>
>>>> This is a regression that went into 4.14.  Wouldn't the appropriate
>>>> action for the stable series be to undo the regression?
>>>
>>> This is not how stable works. Stable kernels *only* contain patches that
>>> are backported from mainline, and do not take standalone patch.
>>>
>>> Furthermore, your fix is to actually undo someone else's fix. Who is
>>> right? In the absence of any documentation, the answer is "nobody".
>>
>> Little more history to this bug.  The code was originally the way it is
>> now, but this same bug was fixed in 2013 in https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.kernel.or&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=bkWxpLoW-f-E3EdiDCCa0_h0PicsViasSlvIpzZvPxs&m=Tk2dtPxJ5WvHjQbWWx35od-JZmur8mRuDrSSVhf8nYs&s=3nFZMUc2qvOXJht1WuNCDorudHeFiWyeDaO1UbhTXmw&e=
>> g/patch/3333681/
>>
>> Then that lasted four years until it was changed Aug 2017 in https://urldefense.proofpoint.com/v2/url?u=https-3A__pa&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=bkWxpLoW-f-E3EdiDCCa0_h0PicsViasSlvIpzZvPxs&m=Tk2dtPxJ5WvHjQbWWx35od-JZmur8mRuDrSSVhf8nYs&s=yfWJAoFIGjLbTnuE_KLzhW7J_pyMD3X1dXm4eHoy7_o&e=
>> tchwork.kernel.org/patch/9893303/
>>
>> That lasted just six months until someone tried to revert it, https://urldefense.proofpoint.com/v2/url?u=https-3A__p&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=bkWxpLoW-f-E3EdiDCCa0_h0PicsViasSlvIpzZvPxs&m=Tk2dtPxJ5WvHjQbWWx35od-JZmur8mRuDrSSVhf8nYs&s=q1LGEb9w_CjcwmWVFOM5VWJ17oS9XHd8SsWTsBF8zfU&e=
>> atchwork.kernel.org/patch/9893303/
>>
>> Seems pretty clear the way it is now is much worse than the way it was
>> before, even if the previous design may have had another flaw.  Though
>> I've yet to see anyone point out something makes the previous design
>> broken.  Sub-optimal yes, but not broken.
> 
> This is not what was reported by the previous submitter. I guess they
> changed this for a reason, no? I'm prepared to admit this is a end-point
> driver bug, but we need to understand what is happening and stop
> changing this driver randomly.
> 
>>> Anything can be backported to stable once we understand the issue. At
>>> the moment, we're just playing games moving stuff around and hope
>>> nothing else will break. That's not a sustainable way of maintaining
>>> this driver. At the moment, the only patch I'm inclined to propose until
>>> we get an actual interrupt handling flow from Synopsys is to mark this
>>> driver as "BROKEN".
>>
>> It feels like you're using this bug to hold designware hostage in a
>> broken kernel, and me along with them.  I don't have the documentation,
>> no one does, there's no way for me to give you want you want.  But I've
>> got hardware that doesn't work in the mainline kernel.
> 
> I take it as a personal offence that I'd be holding anything or anyone
> hostage. I think I have a long enough track record working with the
> Linux kernel not to take any of this nonsense. What's my interest in
> keeping anything in this sorry state? Think about it for a minute.
> 
> When I'm asking for documentation, I'm not asking you. I perfectly
> understood that you don't have any. We need Synopsys to step up and give
> us a simple description of what the MSI workflow is. Because no matter
> how you randomly mutate this code, it will still be broken until we get
> a clear answer from *Synopsys*.
> 
> Gustavo, here's one simple ask. Please reply to this email with a step
> by step, register by register description of how an MSI must be handled
> on this HW. We do need it, and we need it now.

Hi Marc, I thought that I did respond to your question on Nov 8. However I will
repeat it and add some extra information to it now.

About the registers:

PCIE_MSI_INTR0_MASK
When an MSI is received for a masked interrupt, the corresponding status bit
gets set in the interrupt status register but the controller will not signal it.
As soon as the masked interrupt is unmasked and assuming the status bit is still
set the controller will signal it.

PCIE_MSI_INTR0_ENABLE
Enables/disables every interrupt. When an MSI is received from a disabled
interrupt, no status bit gets set in MSI controller interrupt status register.

About this new callbacks:

The dw_pci_bottom_mask() and dw_pci_bottom_unmask() callbacks were added on
Linux kernel v4.17 on PCI: dwc: Move MSI IRQs allocation to IRQ domains
hierarchical API patch [1] and based on what I've seen so far, before this patch
there were no interrupt masking been performed.

Based on the information provided, its very likely that I should use the
PCIE_MSI_INTR0_MASK register instead of PCIE_MSI_INTR0_ENABLE on the
dw_pci_bottom_mask() and dw_pci_bottom_unmask() callbacks. As soon as I return
from Linux plumbers conference, I will test the system using the
PCIE_MSI_INTR0_MASK register.

[1]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v4.20-rc2&id=7c5925afbc58c6d6b384e1dc051bb992969bf787

Is there anything else that has not been clarified yet?

Regards,
Gustavo

> 
> Thanks,
> 
> 	M.
>