Re: [RFC PATCH] PCI, kdump: Clear bus master bit upon shutdown in kdump kernel

Dave Young <dyoung@xxxxxxxxxx> · Fri, 17 Jan 2020 11:24:13 +0800

On 01/15/20 at 02:17pm, Khalid Aziz wrote:
> On 1/15/20 11:05 AM, Kairui Song wrote:
> > On Thu, Jan 16, 2020 at 1:31 AM Khalid Aziz <khalid@xxxxxxxxxxxxxx> wrote:
> >>
> >> On 1/13/20 10:07 AM, Kairui Song wrote:
> >>> On Sun, Jan 12, 2020 at 2:33 AM Deepa Dinamani <deepa.kernel@xxxxxxxxx> wrote:
> >>>>
> >>>>> Hi, there are some previous works about this issue, reset PCI devices
> >>>>> in kdump kernel to stop ongoing DMA:
> >>>>>
> >>>>> [v7,0/5] Reset PCIe devices to address DMA problem on kdump with iommu
> >>>>> https://lore.kernel.org/patchwork/cover/343767/
> >>>>>
> >>>>> [v2] PCI: Reset PCIe devices to stop ongoing DMA
> >>>>> https://lore.kernel.org/patchwork/patch/379191/
> >>>>>
> >>>>> And didn't get merged, that patch are trying to fix some DMAR error
> >>>>> problem, but resetting devices is a bit too destructive, and the
> >>>>> problem is later fixed in IOMMU side. And in most case the DMA seems
> >>>>> harmless, as they targets first kernel's memory and kdump kernel only
> >>>>> live in crash memory.
> >>>>
> >>>> I was going to ask the same. If the kdump kernel had IOMMU on, would
> >>>> that still be a problem?
> >>>
> >>> It will still fail, doing DMA is not a problem, it only go wrong when
> >>> a device's upstream bridge is mistakenly shutdown before the device
> >>> shutdown.
> >>>
> >>>>
> >>>>> Also, by the time kdump kernel is able to scan and reset devices,
> >>>>> there are already a very large time window where things could go
> >>>>> wrong.
> >>>>>
> >>>>> The currently problem observed only happens upon kdump kernel
> >>>>> shutdown, as the upper bridge is disabled before the device is
> >>>>> disabledm so DMA will raise error. It's more like a problem of wrong
> >>>>> device shutting down order.
> >>>>
> >>>> The way it was described earlier "During this time, the SUT sometimes
> >>>> gets a PCI error that raises an NMI." suggests that it isn't really
> >>>> restricted to kexec/kdump.
> >>>> Any attached device without an active driver might attempt spurious or
> >>>> malicious DMA and trigger the same during normal operation.
> >>>> Do you have available some more reporting of what happens during the
> >>>> PCIe error handling?
> >>>
> >>> Let me add more info about this:
> >>>
> >>> On the machine where I can reproduce this issue, the first kernel
> >>> always runs fine, and kdump kernel works fine during dumping the
> >>> vmcore, even if I keep the kdump kernel running for hours, nothing
> >>> goes wrong. If there are DMA during normal operation that will cause
> >>> problem, this should have exposed it.
> >>>
> >>
> >> This is the part that is puzzling me. Error shows up only when kdump
> >> kernel is being shut down. kdump kernel can run for hours without this
> >> issue. What is the operation from downstream device that is resulting in
> >> uncorrectable error - is it indeed a DMA request? Why does that
> >> operation from downstream device not happen until shutdown?
> >>
> >> I just want to make sure we fix the right problem in the right way.
> >>
> > 
> > Actually the device could keep sending request with no problem during
> > kdump kernel running. Eg. keep sending DMA, and all DMA targets first
> > kernel's system memory, so kdump runs fine as long as nothing touch
> > the reserved crash memory. And the error is reported by the port, when
> > shutdown it has bus master bit, and downstream request will cause
> > error.
> > 
> 
> Problem really is there are active devices while kdump kernel is
> running. You did say earlier - "And in most case the DMA seems
> harmless, as they targets first kernel's memory and kdump kernel only
> live in crash memory.". Even if this holds today, it is going to break
> one of these days. There is the "reset_devices" option but that does not
> work if driver is not loaded by kdump kernel. Can we try to shut down
> devices in machine_crash_shutdown() before we start kdump kernel?

It is not a good idea :)  We do not add extra logic after a panic
because the kernel is not stable and we want a correct vmcore.

Similar suggestions had been rejected a lot of times..

Thanks
Dave