On Thu, 2021-05-27 at 21:01 +0200, Borislav Petkov wrote: > On Thu, May 27, 2021 at 11:09:59AM -0700, Srinivas Pandruvada wrote: > > My guess is that system is booting hot sometimes. SMM started fan > > or > > some cooling and set a temperature threshold. It is waiting for > > thermal > > interrupt for temperature threshold, which it never got. > > Are you saying that that replication of lvtthmr_init to the APs in > intel_init_thermal() is absolutely needed on those SMI machines > running > hot? We have seen some SMM uses thermal interrupts. We had one issue in one Yoga systems several years back where SMM handling of thermal interrupt related to HWP caused hard hang as it crashed there. So yes, there may be special thing for cooling also. > > That thing: > > * If BIOS takes over the thermal interrupt and sets its > interrupt > * delivery mode to SMI (not fixed), it restores the value > that the > * BIOS has programmed on AP based on BSP's info we saved > since BIOS > * is always setting the same value for all threads/cores. > > ? > > Me moving that lvtthmr_init read later would replicate the wrong > value > because we'd soft-disable the APIC and thus the core would lockup > waiting... I think so. I will try to force replicate wrong value in Yoga system which used to crash in thermal interrupt handling of SMM code and check what happens. This shouldn't crash as it will not get thermal interrupt. Since the system is not with me, I can try next week. > > The other interesting thing is that the core would always lockup when > trying to IPI another core to remote-flush the TLBs. > Here I think the other core didn't exit SMM mode. Thanks, Srinivas