On Fri, Feb 07, 2025, Doug Covelli wrote: > To test support for nested virtualization I was running a VM (L2) on a > debug build of ESX (L1) on VMware Workstation/KVM (L0). This > consistently resulted in an ASSERT in L1 firing as the interrupt > shadow bit in the VMCB was set on an #NPF exit that occurred when > vectoring through the IDT to deliver an interrupt to L2. > > Some details from our exit recorder are below. Basically what > happened is that L1 resumed L2 after handling an I/O exit and > attempted to inject an internal interrupt with vector 0x68. This > resulted in a #NPF exit when vectoring through the IDT to deliver the > interrupt to the guest with the interrupt shadow bit set which our > code is not expecting. There is no reason for the interrupt shadow > bit to be set and neither L1 or L0 were setting it. > > This turns out to be due to a quirk where on AMD 'vmrun' after an > 'sti' will cause the interrupt shadow bit to leak into the guest state > in the VMCB. Jim Mattson discovered this back when he was with VMware > and checked in a fix to make sure that our 'vmrun' is not immediately > after an 'sti': > > sti /* Enable interrupts during guest execution */ > mov svmPhysCurrentVMCB(%rip), %rax > vmrun /* Must not immediately follow STI. See PR 150935 */ > > PR 150935 describes exactly the same problem I am seeing with KVM. > For KVM the 'vmrun' is immediately after a 'sti' though: > > /* Enter guest mode */ > sti > > 1: vmrun %rax > > I confirmed that moving the 'sti' after the mov instruction in the > VMware code causes the same exact ASSERT to fire. I discussed this > with Jim and Sean and they suggested sending an e-mail to this list. > Jim also mentioned that this was introduced by [1] a few years back. > It would be hard to argue that this isn't an AMD bug but it seems best > to workaround it in SW. It would be great if someone could fix this > but if folks are too busy I can ask Zach to include it in the patches > he is working on. I'll post a patch and a regression test. It took me ~15 minutes to realize the key is taking an exit while injecting an event, i.e. before executing anything in the guest. ~3 minutes to re-learn nested_exceptions_test.c, and 2 seconds to add a testcase: diff --git a/tools/testing/selftests/kvm/x86/nested_exceptions_test.c b/tools/testing/selftests/kvm/x86/nested_exceptions_test.c index 3eb0313ffa39..3641a42934ac 100644 --- a/tools/testing/selftests/kvm/x86/nested_exceptions_test.c +++ b/tools/testing/selftests/kvm/x86/nested_exceptions_test.c @@ -85,6 +85,7 @@ static void svm_run_l2(struct svm_test_data *svm, void *l2_code, int vector, GUEST_ASSERT_EQ(ctrl->exit_code, (SVM_EXIT_EXCP_BASE + vector)); GUEST_ASSERT_EQ(ctrl->exit_info_1, error_code); + GUEST_ASSERT(!ctrl->int_state); } static void l1_svm_code(struct svm_test_data *svm) @@ -122,6 +123,7 @@ static void vmx_run_l2(void *l2_code, int vector, uint32_t error_code) GUEST_ASSERT_EQ(vmreadz(VM_EXIT_REASON), EXIT_REASON_EXCEPTION_NMI); GUEST_ASSERT_EQ((vmreadz(VM_EXIT_INTR_INFO) & 0xff), vector); GUEST_ASSERT_EQ(vmreadz(VM_EXIT_INTR_ERROR_CODE), error_code); + GUEST_ASSERT(!vmreadz(GUEST_INTERRUPTIBILITY_INFO)); } static void l1_vmx_code(struct vmx_pages *vmx)