Re: [PATCH v2] PCI/MSI: Avoid torn updates to MSI pairs

Evan Green <evgreen@xxxxxxxxxxxx> · Thu, 23 Jan 2020 16:29:34 -0800

On Thu, Jan 23, 2020 at 2:59 PM Evan Green <evgreen@xxxxxxxxxxxx> wrote:
>
> On Thu, Jan 23, 2020 at 12:59 PM Evan Green <evgreen@xxxxxxxxxxxx> wrote:
> >
> > On Thu, Jan 23, 2020 at 10:17 AM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
> > >
> > > Evan,
> > >
> > > Thomas Gleixner <tglx@xxxxxxxxxxxxx> writes:
> > > > This is not yet debugged fully and as this is happening on MSI-X I'm not
> > > > really convinced yet that your 'torn write' theory holds.
> > >
> > > can you please apply the debug patch below and run your test. When the
> > > failure happens, stop the tracer and collect the trace.
> > >
> > > Another question. Did you ever try to change the affinity of that
> > > interrupt without hotplug rapidly while the device makes traffic? If
> > > not, it would be interesting whether this leads to a failure as well.
> >
> > Thanks for the patch. Looks pretty familiar :)
> > I ran into issues where trace_printks on offlined cores seem to
> > disappear. I even made sure the cores were back online when I
> > collected the trace. So your logs might not be useful. Known issue
> > with the tracer?
> >
> > I figured I'd share my own debug chicken scratch, in case you could
> > glean anything from it. The LOG entries print out timestamps (divide
> > by 1000000) that you can match up back to earlier in the log (ie so
> > the last XHCI MSI change occurred at 74.032501, the last interrupt
> > came in at 74.032405). Forgive the mess.
> >
> > I also tried changing the affinity rapidly without CPU hotplug, but
> > didn't see the issue, at least not in the few minutes I waited
> > (normally repros easily within 1 minute). An interesting datapoint.
>
> One additional datapoint. The intel guys suggested enabling
> CONFIG_IRQ_REMAP, which does seem to eliminate the issue for me. I'm
> still hoping there's a smaller fix so I don't have to add all that in.

I did another experiment that I think lends credibility to my torn MSI
hypothesis. I have the following change:

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 1f69b12d5bb86..0336d23f9ba9a 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1798,6 +1798,7 @@ void (*machine_check_vector)(struct pt_regs *,
long error_code) =

 dotraplinkage void do_mce(struct pt_regs *regs, long error_code)
 {
+printk("EVAN MACHINE CHECK HC died");
        machine_check_vector(regs, error_code);
 }

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 23a363fd4c59c..31f683da857e3 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -315,6 +315,11 @@ void __pci_write_msi_msg(struct msi_desc *entry,
struct msi_msg *msg)
                msgctl |= entry->msi_attrib.multiple << 4;
                pci_write_config_word(dev, pos + PCI_MSI_FLAGS, msgctl);

+if (entry->msi_attrib.is_64) {
+pci_write_config_word(dev, pos + PCI_MSI_DATA_64, 0x4012);
+} else {
+pci_write_config_word(dev, pos + PCI_MSI_DATA_32, 0x4012);
+}
                pci_write_config_dword(dev, pos + PCI_MSI_ADDRESS_LO,
                                       msg->address_lo);
                if (entry->msi_attrib.is_64) {

And indeed, I get a machine check, despite the fact that MSI_DATA is
overwritten just after address is updated.

[   79.937179] smpboot: CPU 1 is now offline
[   80.001685] smpboot: CPU 3 is now offline
[   80.025210] smpboot: CPU 5 is now offline
[   80.049517] smpboot: CPU 7 is now offline
[   80.094263] x86: Booting SMP configuration:
[   80.099394] smpboot: Booting Node 0 Processor 1 APIC 0x1
[   80.136233] smpboot: Booting Node 0 Processor 3 APIC 0x3
[   80.155732] smpboot: Booting Node 0 Processor 5 APIC 0x5
[   80.173632] smpboot: Booting Node 0 Processor 7 APIC 0x7
[   80.297198] smpboot: CPU 1 is now offline
[   80.331347] EVAN MACHINE CHECK HC died
[   82.281555] Kernel panic - not syncing: Timeout: Not all CPUs
entered broadcast exception handler
[   82.295775] Kernel Offset: disabled
[   82.301740] gsmi: Log Shutdown Reason 0x02
[   82.313942] Rebooting in 30 seconds..
[  112.204113] ACPI MEMORY or I/O RESET_REG.

-Evan