On Wed, Apr 17, 2024, Thomas Prescher wrote: > On Tue, 2024-04-16 at 11:07 -0700, Sean Christopherson wrote: > > Hur dur, I forgot that KVM provides a "guest_mode" stat. Userspace can do > > KVM_GET_STATS_FD on the vCPU FD to get a file handle to the binary stats, > > and then you wouldn't need to call back into KVM just to query guest_mode. > > > > Ah, and I also forgot that we have kvm_run.flags, so adding > > KVM_RUN_X86_GUEST_MODE would also be trivial (I almost suggested it > > earlier, but didn't want to add a new field to kvm_run without a very good > > reason). > > Thanks for the pointers. This is really helpful. > > I tried the "guest_mode" stat as you suggested and it solves the > immediate issue we have with VirtualBox/KVM. Note, > What I don't understand is that we do not get the effective CR4 value > of the L2 guest in kvm_run.s.regs.sregs.cr4. Because what you're asking for is *not* the effective CR4 value of L2. E.g. if L1 is using legacy shadowing paging to run L2, L1 is likely going to run L2 with GUEST_CR0.PG=1, GUEST_CR4.PAE=1, and GUEST_CR4.PSE=0 (though PSE is largely irrelevant), i.e. will either use PAE paging or 64-bit paging to shadow L2. But L2 itself could be unpaged (CR0.PG=0, CR4.PAE=x, CR4.PSE=x), using 32-bit paging (CR0.PG=1, CR4.PAE=0, CR4.PSE=0), or using 32-bit paging with 4MiB hugepages (CR0.PG=1, CR4.PAE=0, CR4.PSE=1). In all of those cases, the effective CR0 and CR4 values consumed by hardware are CR0.PG=1, CR4.PAE=1, and CR4.PSE. Or to convolute things even further, if L0 is running L1 with shadowing paging, and L1 is running L2 with shadow paging but doing something weird and using PSE paging, then it would be possible to end up with: vmcs12->guest_cr4: .pae = 0 .pse = 1 vmcs12->cr4_read_shadow: .pae = 0 .pse = 0 vmcs02->guest_cr4: .pae = 1 .pse = 0 > Instead, userland sees the contents of Vmcs::GUEST_CR4. Shouldn't this be the > combination of GUEST_CR4, GUEST_CR4_MASK and CR4_READ_SHADOW, i.e. what L2 > actually sees as CR4 value? Because KVM_{G,S}ET_SREGS (and all other uAPIs in that vein) are defined to operate on actual vCPU state, and having them do something different if the vCPU is in guest mode would confusing/odd, and nonsensical to differences between VMX and SVM. SVM doesn't have per-bit CR0/CR4 controls, i.e. CR4 loads and stores need to be intercepted, and so having KVM_{G,S}ET_SREGS operate on CR4_READ_SHADOW for VMX would yield different ABI for VMX versus SVM. Note, what L2 *sees* is not a combination of the above; what L2 sees is purely CR4_READ_SHADOW. The other fields are consulted only if L2 attempts to load CR4. > If this is expected, can you please explain the reasoning behind this > interface decision? For me, it does not make sense that writing back > the same value we receive at exit time causes a change in what L2 sees > for CR4. I doubt there was ever a concious decision, rather it never came up and thus the code is the result of doing nothing when nested VMX support was added. That said, KVM's behavior is probably the least awful choice. The changelog of the proposed patch is wrong when it says: If the signal is meant to be delivered to the L0 VMM, and L0 updates CR4 for L1 because the update isn't for L1, it's for the active vCPU state, which is L2. At first glance, skipping the vmcs02.CR4_READ_SHADOW seems to make sense, but it would create a bizarre inconsistency as KVM_SET_SREGS would effectively override vmcs12->guest_cr4, but not vmcs12->cr4_read_shadow. KVM doesn't know the intent of userspace, i.e. KVM can't know if userspace wants to change just the effective value for CR4, or if userspace wants to change the effective *and* observable value for CR4. In your case, where writing CR4 is spurious, preserving the read shadow works, but if there were some use case where userspace actually wanted to change L2's CR4, leaving the read shadow set to vmcs12 would be wrong. The whole situation is rather nonsensical, because if userspace actually did change CR4, the changes would be lost on the next nested VM-Exit => VM-Entry. That could be solved by writing to vmcs12, but that creates a headache of its own because then userspace changes to L2 become visible to L1, without userspace explicitly requesting that. Unfortunately, simply disallowing state save/restore when L2 is active doesn't work either, because userspace needs to be able to save/restore state that _isn't_ context switched by hardware, i.e. isn't in the VMCS or VMCB. In short, yes, it's goofy and annoying, but there's no great solution and the issue really does need to be solved/avoided in userspace > Another question is: when we want to save the VM state during a > savevm/loadvm cycle, we kick all vCPUs via a singal and save their > state. If any vCPU runs in L2 at the time of the exit, we somehow need > to let it continue to run until we get an exit with the L1 state. Is > there a mechanism to help us here? Hmm, no? What is it you're trying to do, i.e. why are you doing save/load? If you really want to save/load _all_ state, the right thing to do is to also save and load nested state.