Re: Hosts got stuck with vmx: unexpected exit reason 0x3

Sean Christopherson <seanjc@xxxxxxxxxx> · Mon, 1 Apr 2024 13:48:11 -0700

On Thu, Mar 28, 2024, jiang.kun2@xxxxxxxxxx wrote:
> Dear KVM experts,
> 
> We have two hosts that got stuck, and the last serial port logs had
> kvm prints vmx: unexpected exit reason 0x3.
> 
> last logs of HostA:
> [23031085.916249] kvm [9737]: vcpu6, guest rIP: 0xffffffffb190d1b5 vmx: unexpected exit reason 0x3
> [23031085.916251] set kvm_intel.dump_invalid_vmcs=1 to dump internal KVM state.
> 
> last logs of HostB:
> [16755112.797211] kvm [2787303]: vcpu11, guest rIP: 0x70a8f4 vmx: unexpected exit reason 0x3
> [16755112.797213] kvm [2787303]: vcpu16, guest rIP: 0x70a9ae vmx: unexpected exit reason 0x3
> [16755112.797214] kvm [2787303]: vcpu17, guest rIP: 0x70a9ae vmx: unexpected exit reason 0x3
> [16755112.797217] kvm [2787303]: vcpu15, guest rIP: 0x70d707 vmx: unexpected exit reason 0x3
> [16755112.797219] kvm [2787303]: vcpu12, guest rIP: 0x701431 vmx: unexpected exit reason 0x3
> [16755112.797221] kvm [2787303]: vcpu7, guest rIP: 0x70b005 vmx: unexpected exit reason 0x3
> [16755112.797222] set kvm_intel.dump_invalid_vmcs=1 to dump internal KVM state.
> [16755112.797224] kvm [2787303]: vcpu4, guest rIP: 0x796fa6 vmx: unexpected exit reason 0x3
> [16755112.797224] set kvm_intel.dump_invalid_vmcs=1 to dump internal KVM state.
> [16755112.797229] kvm [3588862]: vcpu3, guest rIP: 0xffffffff816c7a1b vmx: unexpected exit reason 0x3
> [16755112.797230] set kvm_intel.dump_invalid_vmcs=1 to dump internal KVM state.
> [16755112.797231] set kvm_intel.dump_invalid_vmcs=1 to dump internal KVM state.
> [16755112.797231] set kvm_intel.dump_invalid_vmcs=1 to dump internal KVM state.
> [16755112.797232] set kvm_intel.dump_invalid_vmcs=1 to dump internal KVM state.
> [16755112.797233] set kvm_intel.dump_invalid_vmcs=1 to dump internal KVM state.
> [16755112.797235] kvm [9066]: vcpu5, guest rIP: 0xffffffff8a4a1c0e vmx: unexpected exit reason 0x3
> [16755112.797236] set kvm_intel.dump_invalid_vmcs=1 to dump internal KVM state.
> [16755112.797236] set kvm_intel.dump_invalid_vmcs=1 to dump internal KVM state.
> [16755112.797262] kvm [2813867]: vcpu0, guest rIP: 0xffffffff816c7a1b vmx: unexpected exit reason 0x3
> [16755112.797263] set kvm_intel.dump_invalid_vmcs=1 to dump internal KVM state.
> [18446744004.989880] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
> [18446744004.989880] PGD 0 P4D 0
> [18446744004.989880] Oops: 0000 [#1] SMP NOPTI
> [18446744004.989880] CPU: 10 PID: 0 Comm: swapper/10 Kdump: loaded Tainted: G           OE    --------- -t - 4.18.0-193.14.2.el8_2.x86_64 #1
> [18446744004.989880] Hardware name: xxxxx, BIOS xx.xx.xxxx 02/18/2020
> [18446744004.989880] RIP: 0010:__list_add_valid+0x0/0x50
> [18446744004.989880] Code: ff ff 49 c7 07 00 00 00 00 41 c7 47 08 00 00 00 00 48 89 44 24 28 e9 dc fe ff ff 48 89 6c 24 28 e9 d2 fe ff ff e8 20 08 c8 ff <48> 8b 42 08 49 89 d0 48 39 f0 0f 85 8c 00 00 00 48 8b 10 4c 39 c2
> 
> Kernel version is: 4.18.0-193.14.2.el8_2.x86_64
> CPU is Intel(R) Xeon(R) Gold 6230N CPU @ 2.30GHz
> 
> When the hosts were found to be stuck, both had been stuck for several days.
> We tried triggering a panic collection of vmcore using sysrq+c magic key,
> but there was no response. Eventually, we had to do a hard reboot by pressing
> the power button to recover.
> 
> There is no crashdump generated.
> 
> Before the two hosts got stuck, they both printed vmx: unexpected exit
> reason 0x3. Looking at the code, we found exit reason 0x3 is
> EXIT_REASON_INIT_SIGNAL, means that the current CPU received INIT IPI in
> non-root mode. But found INIT IPI is only sent during CPU setup.
> Anyone know why INIT IPI is generated?

Software (including BIOS/UEFI/firmware) uses INIT to rendezvous APs with the BSP
when taking control from some other piece of software.  Most commonly, that happens
during CPU setup, but I wouldn't say it's strictly limited to "setup", e.g. it
can come into play when kexec'ing into a new kernel, say after a crash.

Some hardware (chipsets?) will also send INIT in response to a triple fault
shutdown, though I've no idea if that's still true on modern hardware.  E.g. if
the BSP hits a triple fault, it could get hit with an INIT and jump back to the
reset vector and thus BIOS/UEFI, and potentially try to wake APs with INIT, while
the APs are still actively running KVM guests.

> HostB printed NULL pointer BUG, but the panic process did not proceed further
> and instead got stuck. The time 18446744004.989880 is incorrect, the uptime
> of HostB is 193 days.
> 
> We suspect hostB&apos;s exception are also related to the previous vmx unexpected
> exit. Anyone encountered similar cases before? Are there any solutions
> and suggestions?

Odds are very, very good that the unexpected INIT VM-Exit is a symptom, not the
root cause.  The most likely scenario is that the host encountered a fatal error
and either hit shutdown or tried to panic, and that fatal error eventually led
to BIOS or a kdump kernel trying to rendezvous with APs via INIT-SIPI, which in
turn triggered the unexpected VM-Exits.

But without more information on what the _other_ CPUs were doing, it's practically
impossible to even make a guess as to what went wrong.  And it's even more impossible
since you're running a relatively ancient kernel, which likely has quite a few out
of tree patches (I'm not criticizing running an older kernel, just saying that it
means no one in upstream is likely to have any guesses as to what went wrong).