On Tue, Mar 13, 2012 at 09:33:50AM -0400, Don Zickus wrote: > On Tue, Mar 13, 2012 at 11:11:49AM +0900, Fernando Luis V?zquez Cao wrote: > > On 03/13/2012 05:16 AM, H. Peter Anvin wrote: > > >On 03/12/2012 01:04 PM, H. Peter Anvin wrote: > > >>On 03/12/2012 01:01 PM, Eric W. Biederman wrote: > > >>>The basic problem is which source do we block this at? How many > > >>>sources are their? And architecturally last I looked x86 no longer > > >>>has a NMI disable EFI and similar systems want to get away without > > >>>a CMOS legacy clock because designers so often get them wrong. > > >>> > > >>On all processors which have an LAPIC you can block all NMI sources at > > >>the LAPIC. I think it's safe to assume that if you don't have an LAPIC > > >>-- an ancient system by now -- you have port 70h. > > >> > > >One thing: *disabling* the LAPIC will allow external NMIs coming in on > > >LINT1 through, since the LAPIC in the disabled state tries to mimic the > > >no-LAPIC configuration. So I don't think you want to disable LAPIC as > > >much as disable the interrupt vectors within. > > > > Does this sound like a plan to get the ball rolling?: > > > > 1.- Merge Don's patch to disable the LAPIC in kdump reboot path (this > > fixes a real issue seen in the field, is a net win and certainly not a > > regression - indeed it makes the code simpler because the I/O > > APICs are left untouched). > > I think you mean my patch to stop disabling the I/O APIC. That patch > hasn't seen any new issues. It was the piece that stopped disabling the > LAPIC that opened the doors for NMIs to fault the system. > > > > > 2.- Merge my patch set to ignore early NMIs (this brings the behavior > > of the boot code in line with what we do in the rest of the kernel > > a we can avoid situations were a spurious NMI causes the kernel > > to halt). The early NMI handler is temporary and the final NMI > > handler installed shortly afterwards will take care of subsequent > > NMIs. > > > > 3.- Make sure that spurious NMIs (i.e. NMIs that for whatever reason > > could not be stopped at the source) received during the reboot > > path to the kdump kernel do not cause a triple fault or a system > > lockup. This is under testing. > > This will require changes in kexec-tools as the purgatory code zaps the > GDT I believe. This is going to make a 'complete solution' dependent on > a version of kexec-tools. Not sure what we want to do there. Ouch. I guess that in the event that purgatory needs to be modified some backwards-compatibility will need to be provided, possibly by allowing purgatory to switch between two behaviours based on e.g. a command line parameter. > > 4.- Identify all the NMI sources and keep them from reaching the CPU > > when it can be done in a race-free way.