Am 02.10.2010 19:25, Avi Kivity wrote: > On 10/01/2010 06:30 PM, Jan Kiszka wrote: >> Hi, >> >> for the past days I've been trying to understand a very strange hard >> lock-up of some Intel i7 boxes when running our 16-bit guest OS under >> KVM. After applying some instrumentation before and after the VM entry >> (e.g. direct write to VGA memory), it turned out that the system is >> apparently stuck inside guest mode! > > Strictly speaking, it could also be a crash in the small window between > vmexit and your writes. However it's likely to be as you say. > >> I double-checked that VM exits on external IRQs and NMIs are properly >> enabled in the VMCS - they are. I also tried to capture any potential >> last words via serial console and even via remote DMA over Firewire) - >> nothing. This likely means that not only the one core in guest mode is >> stuck but all the others as well (note: the freeze is reproducible both >> in UP and SMP mode). Very uncommon for an OS crash I would say... >> >> So I decided to go for some nice conspiracy theory and put SMIs and >> related BIOS code under suspect. Interestingly, this worked out: >> >> After disabling all SMIs on my box (Fujitsu Celsius H700) via the >> chipset register, the hard freezes no longer occurred up to now. My >> customer was able to confirm this on some Lenovo Notebook as well. We >> are currently collecting data about the affected systems to correlate >> it, and we are performing longer test runs. >> >> Nevertheless, I would like to collect some first comments on this. I'm >> specifically wondering... >> >> - if there is anything the host OS can mess up to make VM exits crash >> on the way into SMM or out again (I cannot imagine as the SMM monitor >> should always be able to run, at least in the absence of CPU >> erratas). > > Yes. It's basically a small hypervisor, and the host OS is its guest. > So a well written SMM handler should not depend on any OS setting. > Whether they're actually tested this way is another matter. > >> - what the SMM monitor could do wrong to cause such a crash, >> especially as it looks like the hardware does all the switching for >> it. > > Looks like SMM saves some handler-visible state when EPT is enabled. > Are all your failures on EPT-capable hosts? If so, what happens when > EPT is disabled? All Core i7 should support EPT, so we should have this enabled on all affected systems. However, ept=0 makes no difference on my box, it still locks up. > >> - if there could still be some KVM crash around host<->guest switching >> that just happens to be triggered by the SMI noise and that affects >> the whole system (including cores that do not host KVM threads). >> >> Any ideas warmly welcome! > > Besides trying with ept=0, I suggest looking for machines that have SMIs > but do not crash. If we find them, this seems to indicate a badly > written SMM handler. If not, then there may be a systemic problem with > kvm (or perhaps all SMM handlers are badly written). We are looking for the BIOS vendors. In my case, it is Phoenix, but at least the Lenovos have been re-branded. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html