On Tue, 2021-12-07 at 19:30 +0000, Sean Christopherson wrote: > Synthesize a triple fault if L2 guest state is invalid at the time of > VM-Enter, which can happen if L1 modifies SMRAM or if userspace stuffs > guest state via ioctls(), e.g. KVM_SET_SREGS. KVM should never emulate > invalid guest state, since from L1's perspective, it's architecturally > impossible for L2 to have invalid state while L2 is running in hardware. > E.g. attempts to set CR0 or CR4 to unsupported values will either VM-Exit > or #GP. > > Modifying vCPU state via RSM+SMRAM and ioctl() are the only paths that > can trigger this scenario, as nested VM-Enter correctly rejects any > attempt to enter L2 with invalid state. > > RSM is a straightforward case as (a) KVM follows AMD's SMRAM layout and > behavior, and (b) Intel's SDM states that loading reserved CR0/CR4 bits > via RSM results in shutdown, i.e. there is precedent for KVM's behavior. > Following AMD's SMRAM layout is important as AMD's layout saves/restores > the descriptor cache information, including CS.RPL and SS.RPL, and also > defines all the fields relevant to invalid guest state as read-only, i.e. > so long as the vCPU had valid state before the SMI, which is guaranteed > for L2, RSM will generate valid state unless SMRAM was modified. Intel's > layout saves/restores only the selector, which means that scenarios where > the selector and cached RPL don't match, e.g. conforming code segments, > would yield invalid guest state. Intel CPUs fudge around this issued by > stuffing SS.RPL and CS.RPL on RSM. Per Intel's SDM on the "Default > Treatment of RSM", paraphrasing for brevity: > > IF internal storage indicates that the [CPU was post-VMXON] > THEN > enter VMX operation (root or non-root); > restore VMX-critical state as defined in Section 34.14.1; > set to their fixed values any bits in CR0 and CR4 whose values must > be fixed in VMX operation [unless coming from an unrestricted guest]; > IF RFLAGS.VM = 0 AND (in VMX root operation OR the > “unrestricted guest” VM-execution control is 0) > THEN > CS.RPL := SS.DPL; > SS.RPL := SS.DPL; > FI; > restore current VMCS pointer; > FI; > > Note that Intel CPUs also overwrite the fixed CR0/CR4 bits, whereas KVM > will sythesize TRIPLE_FAULT in this scenario. KVM's behavior is allowed > as both Intel and AMD define CR0/CR4 SMRAM fields as read-only, i.e. the > only way for CR0 and/or CR4 to have illegal values is if they were > modified by the L1 SMM handler, and Intel's SDM "SMRAM State Save Map" > section states "modifying these registers will result in unpredictable > behavior". > > KVM's ioctl() behavior is less straightforward. Because KVM allows > ioctls() to be executed in any order, rejecting an ioctl() if it would > result in invalid L2 guest state is not an option as KVM cannot know if > a future ioctl() would resolve the invalid state, e.g. KVM_SET_SREGS, or > drop the vCPU out of L2, e.g. KVM_SET_NESTED_STATE. Ideally, KVM would > reject KVM_RUN if L2 contained invalid guest state, but that carries the > risk of a false positive, e.g. if RSM loaded invalid guest state and KVM > exited to userspace. Setting a flag/request to detect such a scenario is > undesirable because (a) it's extremely unlikely to add value to KVM as a > whole, and (b) KVM would need to consider ioctl() interactions with such > a flag, e.g. if userspace migrated the vCPU while the flag were set. > > Cc: stable@xxxxxxxxxxxxxxx > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx> > --- > arch/x86/kvm/vmx/vmx.c | 32 ++++++++++++++++++++++++-------- > 1 file changed, 24 insertions(+), 8 deletions(-) > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index 9e415e5a91ab..5307fcee3b3b 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -5900,18 +5900,14 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath) > vmx_flush_pml_buffer(vcpu); > > /* > - * We should never reach this point with a pending nested VM-Enter, and > - * more specifically emulation of L2 due to invalid guest state (see > - * below) should never happen as that means we incorrectly allowed a > - * nested VM-Enter with an invalid vmcs12. > + * KVM should never reach this point with a pending nested VM-Enter. > + * More specifically, short-circuiting VM-Entry to emulate L2 due to > + * invalid guest state should never happen as that means KVM knowingly > + * allowed a nested VM-Enter with an invalid vmcs12. More below. > */ > if (KVM_BUG_ON(vmx->nested.nested_run_pending, vcpu->kvm)) > return -EIO; > > - /* If guest state is invalid, start emulating */ > - if (vmx->emulation_required) > - return handle_invalid_guest_state(vcpu); > - > if (is_guest_mode(vcpu)) { > /* > * PML is never enabled when running L2, bail immediately if a > @@ -5933,10 +5929,30 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath) > */ > nested_mark_vmcs12_pages_dirty(vcpu); > > + /* > + * Synthesize a triple fault if L2 state is invalid. In normal > + * operation, nested VM-Enter rejects any attempt to enter L2 > + * with invalid state. However, those checks are skipped if > + * state is being stuffed via RSM or KVM_SET_NESTED_STATE. If > + * L2 state is invalid, it means either L1 modified SMRAM state > + * or userspace provided bad state. Synthesize TRIPLE_FAULT as > + * doing so is architecturally allowed in the RSM case, and is > + * the least awful solution for the userspace case without > + * risking false positives. > + */ > + if (vmx->emulation_required) { > + nested_vmx_vmexit(vcpu, EXIT_REASON_TRIPLE_FAULT, 0, 0); > + return 1; > + } > + > if (nested_vmx_reflect_vmexit(vcpu)) > return 1; > } > > + /* If guest state is invalid, start emulating. L2 is handled above. */ > + if (vmx->emulation_required) > + return handle_invalid_guest_state(vcpu); > + > if (exit_reason.failed_vmentry) { > dump_vmcs(vcpu); > vcpu->run->exit_reason = KVM_EXIT_FAIL_ENTRY; Reviewed-by: Maxim Levitsky <mlevitsk@xxxxxxxxxx> Best regards, Maxim Levitsky