Re: VMX: System lock-up in guest mode, BIOS under suspect

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



 On 10/01/2010 06:30 PM, Jan Kiszka wrote:
Hi,

for the past days I've been trying to understand a very strange hard
lock-up of some Intel i7 boxes when running our 16-bit guest OS under
KVM. After applying some instrumentation before and after the VM entry
(e.g. direct write to VGA memory), it turned out that the system is
apparently stuck inside guest mode!

Strictly speaking, it could also be a crash in the small window between vmexit and your writes. However it's likely to be as you say.

I double-checked that VM exits on external IRQs and NMIs are properly
enabled in the VMCS - they are. I also tried to capture any potential
last words via serial console and even via remote DMA over Firewire) -
nothing. This likely means that not only the one core in guest mode is
stuck but all the others as well (note: the freeze is reproducible both
in UP and SMP mode). Very uncommon for an OS crash I would say...

So I decided to go for some nice conspiracy theory and put SMIs and
related BIOS code under suspect. Interestingly, this worked out:

After disabling all SMIs on my box (Fujitsu Celsius H700) via the
chipset register, the hard freezes no longer occurred up to now. My
customer was able to confirm this on some Lenovo Notebook as well. We
are currently collecting data about the affected systems to correlate
it, and we are performing longer test runs.

Nevertheless, I would like to collect some first comments on this. I'm
specifically wondering...

  - if there is anything the host OS can mess up to make VM exits crash
    on the way into SMM or out again (I cannot imagine as the SMM monitor
    should always be able to run, at least in the absence of CPU
    erratas).

Yes. It's basically a small hypervisor, and the host OS is its guest. So a well written SMM handler should not depend on any OS setting. Whether they're actually tested this way is another matter.

  - what the SMM monitor could do wrong to cause such a crash,
    especially as it looks like the hardware does all the switching for
    it.

Looks like SMM saves some handler-visible state when EPT is enabled. Are all your failures on EPT-capable hosts? If so, what happens when EPT is disabled?

  - if there could still be some KVM crash around host<->guest switching
    that just happens to be triggered by the SMI noise and that affects
    the whole system (including cores that do not host KVM threads).

Any ideas warmly welcome!

Besides trying with ept=0, I suggest looking for machines that have SMIs but do not crash. If we find them, this seems to indicate a badly written SMM handler. If not, then there may be a systemic problem with kvm (or perhaps all SMM handlers are badly written).

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux