On 17/04/2023 11:30 am, Peter Zijlstra wrote: > On Sat, Apr 15, 2023 at 01:44:13AM +0200, Thomas Gleixner wrote: > >> Background >> ---------- >> >> The reason why people are interested in parallel bringup is to shorten >> the (kexec) reboot time of cloud servers to reduce the downtime of the >> VM tenants. There are obviously other interesting use cases for this >> like VM startup time, embedded devices... > ... > >> There are two issue there: >> >> a) The death by MCE broadcast problem >> >> Quite some (contemporary) x86 CPU generations are affected by >> this: >> >> - MCE can be broadcasted to all CPUs and not only issued locally >> to the CPU which triggered it. >> >> - Any CPU which has CR4.MCE == 0, even if it sits in a wait >> for INIT/SIPI state, will cause an immediate shutdown of the >> machine if a broadcasted MCE is delivered. > When doing kexec, CR4.MCE should already have been set to 1 by the prior > kernel, no? No(ish). Purgatory can't take #MC, or NMIs for that matter. It's cleaner to explicitly disable CR4.MCE and let the system reset (with all the MC banks properly preserved), than it is to take #MC while the IDT isn't in sync with the handlers, and wander off into the weeds. ~Andrew