On Fri, Feb 10, 2012 at 08:03:53PM +0100, Peter Zijlstra wrote: > On Fri, 2012-02-10 at 19:58 +0100, Peter Zijlstra wrote: > > OK, so a 'modern' kernel does it slightly different and I've no idea > > what exactly goes wrong in your vintage version. But I can see the > > current stuff going at it all wrong. > > > > What seems to happen is that native_nmi_stop_other_cpus() NMI broadcasts > > for smp_stop_nmi_callback()->stop_this_cpu(). Which without any > > serialization what so ever marks all remote CPUs offline and calls halt > > with IRQs disabled -> dead. > > > > While we're waiting for this all to complete, the scheduler tries to > > no_hz load-balance and kick a cpu it thinks is still around and we get > > the above splat because the NMI just marked it offline without telling > > anybody about it. > > > > Now, arguably you don't want to go through the whole hotplug crap to > > shut down your machine, esp not on panic, but clearing the online state > > without telling anybody about it is bound to lead to these things. > > > > No immediate solution comes to mind... > > Don, any reason you wait for the NMI broadcast to complete with IRQs > enabled? If you disable IRQs before the broadcast the interrupt can't > happen and should side-step this particular problem. Well I believe the old way had the same problem using the REBOOT_IRQ as opposed to NMI. I also don't know how to shutdown interrupts system wide without just broadcasting an IRQ to locally disable interrupts. > > Its not like we have 'latency' issues on this path :-) Heh. Oddly I was writing the changelog for a patch that kinda changes this path to sorta revert back to the old way of using a REBOOT_IRQ with an NMI follow-on when the IRQ fails. Originally, I wanted to make sure the cpus were shutdown immediately so we can serialize the panic path hence the original change. I also ran into the same problem you did and hacked up another patch that checked a global atomic variable that let the system know we were shutting down and not to do the WARN_ON (the global is already created for the NMI case now). I'll try to post that soon once I finish my long winded changelog. Though it kinda addresses your issue, I'm not sure it does it in a way that will satisfy you. But I look forward to the discussion. :-) Cheers, Don -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html