On Wed, 2024-04-17 at 09:11 -0700, Sean Christopherson wrote: > On Wed, Apr 17, 2024, Thomas Prescher wrote: > > On Tue, 2024-04-16 at 11:07 -0700, Sean Christopherson wrote: > > > Hur dur, I forgot that KVM provides a "guest_mode" stat. > > > Userspace can do > > > KVM_GET_STATS_FD on the vCPU FD to get a file handle to the > > > binary stats, > > > and then you wouldn't need to call back into KVM just to query > > > guest_mode. > > > > > > Ah, and I also forgot that we have kvm_run.flags, so adding > > > KVM_RUN_X86_GUEST_MODE would also be trivial (I almost suggested > > > it > > > earlier, but didn't want to add a new field to kvm_run without a > > > very good > > > reason). > > > > Thanks for the pointers. This is really helpful. > > > > I tried the "guest_mode" stat as you suggested and it solves the > > immediate issue we have with VirtualBox/KVM. > > Note, > > > What I don't understand is that we do not get the effective CR4 > > value > > of the L2 guest in kvm_run.s.regs.sregs.cr4. > > Because what you're asking for is *not* the effective CR4 value of > L2. > > E.g. if L1 is using legacy shadowing paging to run L2, L1 is likely > going to run > L2 with GUEST_CR0.PG=1, GUEST_CR4.PAE=1, and GUEST_CR4.PSE=0 (though > PSE is largely > irrelevant), i.e. will either use PAE paging or 64-bit paging to > shadow L2. > > But L2 itself could be unpaged (CR0.PG=0, CR4.PAE=x, CR4.PSE=x), > using 32-bit > paging (CR0.PG=1, CR4.PAE=0, CR4.PSE=0), or using 32-bit paging with > 4MiB hugepages > (CR0.PG=1, CR4.PAE=0, CR4.PSE=1). In all of those cases, the > effective CR0 and CR4 > values consumed by hardware are CR0.PG=1, CR4.PAE=1, and CR4.PSE. > > Or to convolute things even further, if L0 is running L1 with > shadowing paging, > and L1 is running L2 with shadow paging but doing something weird and > using PSE > paging, then it would be possible to end up with: > > vmcs12->guest_cr4: > .pae = 0 > .pse = 1 > > vmcs12->cr4_read_shadow: > .pae = 0 > .pse = 0 > > vmcs02->guest_cr4: > .pae = 1 > .pse = 0 > > > Instead, userland sees the contents of Vmcs::GUEST_CR4. Shouldn't > > this be the > > combination of GUEST_CR4, GUEST_CR4_MASK and CR4_READ_SHADOW, i.e. > > what L2 > > actually sees as CR4 value? > > Because KVM_{G,S}ET_SREGS (and all other uAPIs in that vein) are > defined to operate > on actual vCPU state, and having them do something different if the > vCPU is in guest > mode would confusing/odd, and nonsensical to differences between VMX > and SVM. > > SVM doesn't have per-bit CR0/CR4 controls, i.e. CR4 loads and stores > need to be > intercepted, and so having KVM_{G,S}ET_SREGS operate on > CR4_READ_SHADOW for VMX > would yield different ABI for VMX versus SVM. > > Note, what L2 *sees* is not a combination of the above; what L2 sees > is purely > CR4_READ_SHADOW. The other fields are consulted only if L2 attempts > to load CR4. > > > If this is expected, can you please explain the reasoning behind > > this > > interface decision? For me, it does not make sense that writing > > back > > the same value we receive at exit time causes a change in what L2 > > sees > > for CR4. > > I doubt there was ever a concious decision, rather it never came up > and thus the > code is the result of doing nothing when nested VMX support was > added. > > That said, KVM's behavior is probably the least awful choice. The > changelog of > the proposed patch is wrong when it says: > > If the signal is meant to be delivered to the L0 VMM, and L0 > updates CR4 for L1 > > because the update isn't for L1, it's for the active vCPU state, > which is L2. > > At first glance, skipping the vmcs02.CR4_READ_SHADOW seems to make > sense, but it > would create a bizarre inconsistency as KVM_SET_SREGS would > effectively override > vmcs12->guest_cr4, but not vmcs12->cr4_read_shadow. KVM doesn't know > the intent > of userspace, i.e. KVM can't know if userspace wants to change just > the effective > value for CR4, or if userspace wants to change the effective *and* > observable > value for CR4. > > In your case, where writing CR4 is spurious, preserving the read > shadow works, > but if there were some use case where userspace actually wanted to > change L2's > CR4, leaving the read shadow set to vmcs12 would be wrong. > > The whole situation is rather nonsensical, because if userspace > actually did change > CR4, the changes would be lost on the next nested VM-Exit => VM- > Entry. That could > be solved by writing to vmcs12, but that creates a headache of its > own because then > userspace changes to L2 become visible to L1, without userspace > explicitly requesting > that. > > Unfortunately, simply disallowing state save/restore when L2 is > active doesn't > work either, because userspace needs to be able to save/restore state > that _isn't_ > context switched by hardware, i.e. isn't in the VMCS or VMCB. > > In short, yes, it's goofy and annoying, but there's no great solution > and the > issue really does need to be solved/avoided in userspace > > > Another question is: when we want to save the VM state during a > > savevm/loadvm cycle, we kick all vCPUs via a singal and save their > > state. If any vCPU runs in L2 at the time of the exit, we somehow > > need > > to let it continue to run until we get an exit with the L1 state. > > Is > > there a mechanism to help us here? > > Hmm, no? What is it you're trying to do, i.e. why are you doing > save/load? If > you really want to save/load _all_ state, the right thing to do is to > also save > and load nested state. > You are right. After your pointers and looking at the nesting code again, I think I know what to do. Just to make sure I understand this correctly: If L0 exits with L2 state, KVM_GET_NESTED_STATE will have KVM_STATE_NESTED_RUN_PENDING set in the flags field. So when we restore the vCPU state after a vmsave/vmload cycle, we don't need to update anything in kvm_run.s.regs because KVM will enter the L2 immediately. Is that correct?