On 10/01/2010 06:30 PM, Jan Kiszka wrote:
Hi, for the past days I've been trying to understand a very strange hard lock-up of some Intel i7 boxes when running our 16-bit guest OS under KVM. After applying some instrumentation before and after the VM entry (e.g. direct write to VGA memory), it turned out that the system is apparently stuck inside guest mode!
Strictly speaking, it could also be a crash in the small window between vmexit and your writes. However it's likely to be as you say.
I double-checked that VM exits on external IRQs and NMIs are properly enabled in the VMCS - they are. I also tried to capture any potential last words via serial console and even via remote DMA over Firewire) - nothing. This likely means that not only the one core in guest mode is stuck but all the others as well (note: the freeze is reproducible both in UP and SMP mode). Very uncommon for an OS crash I would say... So I decided to go for some nice conspiracy theory and put SMIs and related BIOS code under suspect. Interestingly, this worked out: After disabling all SMIs on my box (Fujitsu Celsius H700) via the chipset register, the hard freezes no longer occurred up to now. My customer was able to confirm this on some Lenovo Notebook as well. We are currently collecting data about the affected systems to correlate it, and we are performing longer test runs. Nevertheless, I would like to collect some first comments on this. I'm specifically wondering... - if there is anything the host OS can mess up to make VM exits crash on the way into SMM or out again (I cannot imagine as the SMM monitor should always be able to run, at least in the absence of CPU erratas).
Yes. It's basically a small hypervisor, and the host OS is its guest. So a well written SMM handler should not depend on any OS setting. Whether they're actually tested this way is another matter.
- what the SMM monitor could do wrong to cause such a crash, especially as it looks like the hardware does all the switching for it.
Looks like SMM saves some handler-visible state when EPT is enabled. Are all your failures on EPT-capable hosts? If so, what happens when EPT is disabled?
- if there could still be some KVM crash around host<->guest switching that just happens to be triggered by the SMI noise and that affects the whole system (including cores that do not host KVM threads). Any ideas warmly welcome!
Besides trying with ept=0, I suggest looking for machines that have SMIs but do not crash. If we find them, this seems to indicate a badly written SMM handler. If not, then there may be a systemic problem with kvm (or perhaps all SMM handlers are badly written).
-- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html