On 1/4/19 4:00 PM, Stephen Warren wrote:
On 1/4/19 3:45 PM, Trent Piepho wrote:
On Fri, 2019-01-04 at 14:45 -0700, Stephen Warren wrote:
From: Stephen Warren <swarren@xxxxxxxxxx>
The current code does this when handling MSI IRQs:
a) Process the irq.
b) Clear the latched IRQ status.
If a new IRQ occurs any time after (a) has read the IRQ status for the
last time and before (b), it will be lost. For example, this occurs in
There was a patch series to fix this bug, sent in October, stretching
to December.
See "PCI: dwc: Fix interrupt race in when handling MSI"
And "PCI: designware: Move interrupt acking into the proper callback"
The result was more code changes, which in the end produce the same
order of operations as is done here to fix the race. But it uses the
irq framework correctly.
I do not think this problem was introduced until 4.14, so the nvidia
4.9 must have additional patches if this is observed there. I'm sure
it was not present in 4.12, as we used that, and did not see the issue
until updating to 4.16.
Yes, I took the DWC driver from our 4.14 kernel and dropped it into the
4.9 kernel, so that's why the issue shows up in 4.9 for us.
The current fix (for 4.21?) depends on significant restructuring that
was done to this driver around 4.17, see "PCI: dwc: Move MSI IRQs
allocation to IRQ domains hierarchical API".
I think my original patch ("Fix interrupt race ..") or your patch here,
which is basically the same, should be ported to 4.14 stable. But
there are some other opinions on that. It's not clear to me if the
hierarchical domain stuff would be back ported to stable series. It
does not seem to me to be a good idea to do so.
Hmm. Well I guess I'll go for the patch I posted in our downstream
kernels, since back-porting a bunch of not-yet-available restructuring
to our ancient kernels doesn't sound pleasant:-) But I'll go and take a
quick look at the other patches you mentioned just in case. Thanks!
So I went and read the thread for "PCI: dwc: Fix interrupt race in when
handling MSI" and see:
Vignesh R wrote:
> Lorenzo Pieralisi wrote:
AFAICS:
8c934095fa2f ("PCI: dwc: Clear MSI interrupt status after it is handled,
not before")
was fixing a bug, causing "timeouts on some wireless lan cards", we want
to understand what the problem is, fix it once for all on all DWC
based systems.
That issue was root caused to be due to a HW errata in dra7xx DWC
wrapper which requires a special way of handling MSI interrupts at
wrapper level. More info in this thread:
https://www.spinics.net/lists/linux-pci/msg70462.html
Unfortunately, commit 8c934095fa2f did not fix WLAN issue in longer
tests and also broke PCIe USB cards. Therefore, it makes sense to revert
8c934095fa2f
I am working on patches fix dra7xx wrapper for WLAN card issue.
So it seems quite clear that the correct course of action is to
immediately revert commit 8c934095fa2f ("PCI: dwc: Clear MSI interrupt
status after it is handled, not before") (or apply my or Trent's patches
which are effectively reverts) since it (a) attempts to fix a bug (in
the core DWC driver) that doesn't actually exist (the bug is in the DRA
HW and is being fixed in the DRA wrapper driver), and (b) has many
reports that the change causes regressions; I'm at least the 3rd or 4th
person to confirm this now (Faiz originally, then later Trent, Vignesh,
myself).
Now, whether there's some cleanup or additional fixes needed beyond the
simple revert is another question I currently have no insight into, but
let's at least get back to a driver without that worked well for people
for years, even if there's a theoretical issue to be fixed that nobody
hit in practice.