Marc Zyngier <marc.zyngier@xxxxxxx> writes: > On Tue, 01 May 2018 14:25:54 +0100, > Bjorn Helgaas wrote: > > Hi Bjorn, > >> On Tue, May 01, 2018 at 01:59:20PM +0100, Marc Zyngier wrote: >> > On 01/05/18 13:38, Sinan Kaya wrote: >> > > +Marc, >> > > >> > > On 4/30/2018 5:27 PM, Sinan Kaya wrote: >> > >> On 4/30/2018 5:17 PM, Bjorn Helgaas wrote: >> > >>>> What should we do about this? >> > >>>> >> > >>>> Since there is an actual HW errata involved, should we quirk this >> > >>>> root port and not wait as if remove/shutdown doesn't exist? >> > >>> I was hoping to avoid a quirk because AFAIK all Intel parts have this >> > >>> issue so it will be an ongoing maintenance issue. I tried to avoid >> > >>> the timeout delays, e.g., with 40b960831cfa ("PCI: pciehp: Compute >> > >>> timeout from hotplug command start time"). >> > >>> >> > >>> But we still see the alarming messages, so we should probably add a >> > >>> quirk to get rid of those. >> > >>> >> > >>> But I haven't given up on the idea of getting rid of the >> > >>> pciehp_remove() path. I'm not convinced yet that we actually need to >> > >>> do anything to shut this device down. I don't like the assumption >> > >>> that kexec requires this. The kexec is fundamentally just a branch, >> > >>> and anything we do before the branch (i.e., in the old kernel), we >> > >>> should also be able to do after the branch (i.e., in the kexec-ed >> > >>> kernel). >> > >>> >> > >> >> > >> In my experience with kexec, MSI type edge interrupts are harmless. >> > >> You might just see a few unhandled interrupt messages during boot >> > >> if something is pending from the first kernel. >> > >> > Unfortunately, that's not always the case. >> > >> > A number of GICv3/v4 implementations (a very common interrupt controller >> > on ARM servers) cannot be disabled, which means they will keep writing >> > to their pending tables long after kexec will have started the new >> > kernel. And since we don't track memory allocation across kexec, you >> > end-up with significant chances of observing single bit corruption as >> > interrupts carry on being delivered. Oh, and you won't actually be able >> > to take MSIs because you can't even reprogram the damn thing. >> > >> > Yes, this can be considered a HW bug. >> > >> > >> It is the level interrupts that are more concerning. It remains pending >> > >> until the interrupt source is cleared. CPU never returns from the >> > >> interrupt handler to actually continue booting the second kernel. >> > > >> > > This makes me wonder why kexec doesn't disable all interrupt sources by >> > > itself instead of relying on the drivers shutdown routine. Some drivers >> > > don't even have a shutdown callback. Kexec could have done both as another >> > > example. Something like. >> > > >> > > 1. Call shutdown for all drivers if available. >> > > 2. Disable all interrupt sources in the interrupt controller >> > > 3. Start the new kernel. >> > >> > See above. Although you can shut off the end-point and to some extent >> > mask interrupts before jumping into the payload, it is not always >> > possible to go back to a reasonable state where you can take actually MSIs. >> >> This is exactly the sort of thing it would be nice to collect and >> document as part of the background of "why kexec works the way it >> does." It certainly helps explain things that are far from obvious if >> you don't have the background. > > I'd certainly be happy to help with it if someone was willing to > kickstart such a document. kexec/kdump is a huge bag of "interesting" > tricks, and it has driven me mad over the past couple of months (I'm > typing this from a laptop that uses kexec as its bootloader, and it is > *not fun*). I don't know if it helps documentation wise but here is my memory of why things are the way they are. Case 1) kexec-on-panic. In this case we run the new kernel in memory reserved since boot of the previous kernel in memory has never been used by any device driver. This means on-going DMA transactions that we don't manage to shut off are harmless. In actual execution a bare minimum of hardware is shutdown on the kexec-on-panic path. Ideally it would be nothing. The crashing kernel simply can not be trusted to shut things down itself. The kernel that is executing in the after the crash loads a bare minimum of drivers and does it's best to initialize the hardware. Ideally if something goes wrong the kernel will hang before we write to hardware and mess anything up. With this we get something like a 50% or a 60% success rate of capture crashdump in practice in the field. Everything else that has been tried relies more on the crashing kernel and looks great in testing and then turns out to not have a measurable success rate in practice. Using lkdtm you can setup tests of various kinds of kernel corruption and failure and see some approximation of the success rate of kexec will see in practice. I forget where we are with iommus, but the principles remain and iommus tend to tricky just because they get in the middle of everything. If someone stares hard enough we are probably at the point on x86 where we can remove the irq shutdown code. The kexec on panic case tends to be tested more on enterprise kernels than on normal ones. Case 2) Ordinary kexec. The goal is to have a fully functionaly uncompromised system (unlike kexec on panic). Hardware bugs mean that in the general case the only place we can shutdown hardware reliably is the drivers themselves. All devices doing DMA must be shutdown in the kexec'ing kernel. In part because there is no guarantee that we will even load a driver for that hardware. The presence of DMA drove most of the decisions. But from this thread I see that irq handling follows the same pattern. The best place to shut anything down is in the driver where there is full knowledge of how things work. One of the more annoying things that have been discovered is the generic pci dma disable bit doesn't work uniformly acrosss hardware. Which means there is no known generic way to shut down dma across the board. In the prototypes there was only the "remove" method of drivers and that worked well. When it came time to merge the original kexec implementation the maintainer of the power mananagement subsystem insisted we add a new "shutdown" method instead, because while it is necessary to shutdown the hardware you should not need to clean up the data structures. In practice that idea flopped. The most reliable way I know to run kexec is to remmod all of the drivers before runing sys_reboot(..., LINUX_REBOT_CMD_KEXEC, ...) so that the shutdown methods get run. It has been asked and I have given my approval to anyone who wants to do the work to switch form the "shutdown" methods to "remove" on the kexec path. But so far it is a big enough project that no one has done that yet. It has been suggested that hardware does not need to be shutdown at the end of the kernel before returning to a a firmware method. Which is incorrect. Most firmware when it regains control triggers a system reset to get the hardware back into a usable state, and be able to reboot the system. There is a magic register for this on x86. On older x86 systems and others that transfer control to firmware without doing a soft hardware reset of the system and all of the devices. Without shutting down the devices they will work about as well as kexec does when you don't remove the devices. That is why I merged the reboot and the kexec code paths. Well that and so that there is a little more testing. In practice it still seems that rmmod is the only testing that reliably happens to drivers. So not sharing that code path makes kexec more fragile than necessary. Hopefully this helps put things into perspective and can help with your docuement. Eric