On Fri, Apr 01, 2022, Maciej S. Szmigiero wrote: > On 1.04.2022 20:32, Sean Christopherson wrote: > > On Thu, Mar 10, 2022, Maciej S. Szmigiero wrote: > > > + /* The return address pushed on stack by the CPU for some injected events */ > > > + svm->vmcb->control.next_rip = svm->nested.ctl.next_rip; > > > > This needs to be gated by nrips being enabled _and_ exposed to L1, i.e. > > > > if (svm->nrips_enabled) > > vmcb02->control.next_rip = svm->nested.ctl.next_rip; > > It can be done, however what if we run on a nrips-capable CPU, > but don't expose this capability to the L1? Oh, right, because the field will be populated by the CPU on VM-Exit. Ah, the correct behavior is to grab RIP from vmcb12 to emulate nrips=0 hardware simply not updating RIP. E.g. zeroing it out would send L2 into the weeds on IRET due the CPU pushing '0' on the stack when vectoring the injected event. if (svm->nrips_enabled) vmcb02->control.next_rip = svm->nested.ctl.next_rip; else if (boot_cpu_has(X86_FEATURE_NRIPS)) vmcb02->control.next_rip = vmcb12_rip; > The CPU will then push whatever value was left in this field as > the return address for some L1 injected events. > > Although without nrips feature the L1 shouldn't even attempt event > injection, copying this field anyway will make it work if L1 just > expects this capability based on the current CPU model rather than > by checking specific CPUID feature bits. L1 may still inject the exception, it just advances the RIP manually. As above, the really messy thing is that, because there's no flag to say "don't use NextRIP!", the CPU will still consume NextRIP and push '0' on the stack for the return RIP from the INTn/INT3/INTO. Yay. I found that out the hard way (patch in-progress). The way to handle event injection if KVM is loaded with nrips=0 but nrips is supported in hardware is to stuff NextRIP on event injection even if nrips=0, otherwise the guest is hosed. > > > + u64 next_rip; > > > u64 nested_cr3; > > > u64 virt_ext; > > > u32 clean; > > > > I don't know why this struct has > > > > u8 reserved_sw[32]; > > > > but presumably it's for padding, i.e. probably should be reduced to 24 bytes. > > Apparently the "reserved_sw" field stores Hyper-V enlightenments state - > see commit 66c03a926f18 ("KVM: nSVM: Implement Enlightened MSR-Bitmap feature") > and nested_svm_vmrun_msrpm() in nested.c. Argh, that's a terrible name. Thanks for doing the homework, I was being lazy.