Sean Christopherson <seanjc@xxxxxxxxxx> writes: > On Wed, Sep 15, 2021, Vitaly Kuznetsov wrote: >> Sean Christopherson <seanjc@xxxxxxxxxx> writes: >> > +static void __vmx_vcpu_reset(struct kvm_vcpu *vcpu) >> > +{ >> > + struct vcpu_vmx *vmx = to_vmx(vcpu); >> > + >> > + init_vmcs(vmx); >> > + >> > + if (nested) >> > + memcpy(&vmx->nested.msrs, &vmcs_config.nested, sizeof(vmx->nested.msrs)); >> > + >> > + vcpu_setup_sgx_lepubkeyhash(vcpu); >> > + >> > + vmx->nested.posted_intr_nv = -1; >> > + vmx->nested.current_vmptr = -1ull; >> > + vmx->nested.hv_evmcs_vmptr = EVMPTR_INVALID; >> >> What would happen in (hypothetical) case when enlightened VMCS is >> currently in use? If we zap 'hv_evmcs_vmptr' here, the consequent >> nested_release_evmcs() (called from >> nested_vmx_handle_enlightened_vmptrld(), for example) will not do >> kvm_vcpu_unmap() while it should. > > The short answer is that there's a lot of stuff that needs to be addressed before > KVM can expose a RESET ioctl(). My goal with these patches is to carve out the > stubs and move the few bits of RESET emulation into the "stubs". This is the same > answer for the MSR question/comment at the end. > >> This, however, got me thinking: should we free all-things-nested with >> free_nested()/nested_vmx_free_vcpu() upon vcpu reset? I can't seem to >> find us doing that... (I do remember that INIT is blocked in VMX-root >> mode and nobody else besides kvm_arch_vcpu_create()/ >> kvm_apic_accept_events() seems to call kvm_vcpu_reset()) but maybe we >> should at least add a WARN_ON() guardian here? > > I think that makes sense. Maybe use CR0 as a sentinel since it has a non-zero > RESET value? E.g. WARN if CR0 is non-zero at RESET. > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 86539c1686fa..3ac074376821 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -10813,6 +10813,11 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) > unsigned long new_cr0; > u32 eax, dummy; > > + /* > + * <comment about KVM not supporting arbitrary RESET> > + */ > + WARN_ON_ONCE(!init_event && old_cr0); > + > kvm_lapic_reset(vcpu, init_event); > > vcpu->arch.hflags = 0; > Should work (assuming nothing in VMX would want to call vmx_vcpu_reset()/ __vmx_vcpu_reset() directly). > Huh, typing that out made me realize commit 0aa1837533e5 ("KVM: x86: Properly > reset MMU context at vCPU RESET/INIT") technically introduced a bug. kvm_vcpu_reset() > does kvm_read_cr0() and thus reads vmcs.GUEST_CR0 because vcpu->arch.regs_avail is > (correctly) not stuffed to ALL_ONES until later in kvm_vcpu_reset(). init_vmcs() > doesn't explicitly zero vmcs.GUEST_CR0 (along with many other guest fields), and > so VMREAD(GUEST_CR0) is technically consuming garbage. In practice, it's consuming > '0' because no known CPU or VMM inverts values in the VMCS, i.e. zero allocating > the VMCS is functionally equivalent to writing '0' to all fields via VMWRITE. > > And staring more at kvm_vcpu_reset(), this code is terrifying for INIT > > memset(vcpu->arch.regs, 0, sizeof(vcpu->arch.regs)); > vcpu->arch.regs_avail = ~0; > vcpu->arch.regs_dirty = ~0; > > because it means cr0 and cr4 are marked available+dirty without immediately writing > vcpu->arch.cr0/cr4. And VMX subtly relies on that, as vmx_set_cr0() grabs CR0.PG > via kvm_read_cr0_bits(), i.e. zeroing vcpu->arch.cr0 would "break" the INIT flow. > Ignoring for the moment that CR0.PG is never guest-owned and thus never stale in > vcpu->arch.cr0, KVM is also technically relying on the earlier kvm_read_cr0() in > kvm_vcpu_reset() to ensure vcpu->arch.cr0 is fresh. > > Stuffing regs_avail technically means vmx_set_rflags() -> vmx_get_rflags() is > consuming stale data. It doesn't matter in practice because the old value is > only used to re-evaluate vmx->emulation_required, which is guaranteed to be up > refreshed by vmx_set_cr0() and friends. PDPTRs and EXIT_INFO are in a similar > boat; KVM shouldn't be reading those fields (CR0.PG=0, not an exit path), but > marking them dirty without actually updating the cached values is wrong. > > There's also one concrete bug: KVM doesn't set vcpu->arch.cr3=0 on RESET/INIT. > That bug has gone unnoticed because no real word BIOS/kernel is going to rely on > INIT to set CR3=0. > > I'm strongly leaning towards stuffing regs_avail/dirty in kvm_arch_vcpu_create(), > and relying on explicit kvm_register_mark_dirty() calls for the 3 or so cases where > x86 code writes a vcpu->arch register directly. That would fix the CR0 read bug > and also prevent subtle bugs from sneaking in. Adding new EXREGS would be slightly > more costly, but IMO that's a good thing as would force us to actually think about > how to handle each register. > > E.g. implement this over 2-3 patches: > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index 114847253e0a..743146ac8307 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -4385,6 +4385,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) > kvm_set_cr8(vcpu, 0); > > vmx_segment_cache_clear(vmx); > + kvm_register_mark_available(vcpu, VCPU_EXREG_SEGMENTS); > > seg_setup(VCPU_SREG_CS); > vmcs_write16(GUEST_CS_SELECTOR, 0xf000); > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 86539c1686fa..ab907a0b9eeb 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -10656,6 +10656,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu) > int r; > > vcpu->arch.last_vmentry_cpu = -1; > + vcpu->arch.regs_avail = ~0; > + vcpu->arch.regs_dirty = ~0; > > if (!irqchip_in_kernel(vcpu->kvm) || kvm_vcpu_is_reset_bsp(vcpu)) > vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE; > @@ -10874,9 +10876,9 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) > vcpu->arch.xcr0 = XFEATURE_MASK_FP; > } > > + /* All GPRs except RDX (handled below) are zeroed on RESET/INIT. */ > memset(vcpu->arch.regs, 0, sizeof(vcpu->arch.regs)); > - vcpu->arch.regs_avail = ~0; > - vcpu->arch.regs_dirty = ~0; > + kvm_register_mark_dirty(vcpu, VCPU_REGS_RSP); > > /* > * Fall back to KVM's default Family/Model/Stepping of 0x600 (P6/Athlon) > @@ -10897,6 +10899,9 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) > kvm_set_rflags(vcpu, X86_EFLAGS_FIXED); > kvm_rip_write(vcpu, 0xfff0); > > + vcpu->arch.cr3 = 0; > + kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3); > + > /* > * CR0.CD/NW are set on RESET, preserved on INIT. Note, some versions > * of Intel's SDM list CD/NW as being set on INIT, but they contradict > A selftest for vCPU create/reset would be really helpful. I can even volunteer to [eventually] write one :-) -- Vitaly