Query the vCPU's APICv state, not the overall VM's state, when handling a page fault that hit the APIC Access Page memslot. While the vCPU state can be stale with respect to the overall VM APICv state, it's at least coherent with respect to the page fault being handled, i.e. is the value of the vCPU's APICv enabling at the time of the fault. Using the VM's state is not functionally broken, but neither does it provide any meaningful advantages, and arguably retrying the faulting instruction, as will happen if the vCPU state is stale (see below), is semantically the desired behavior as it aligns with existing KVM MMU behavior where a change in VM state invalidates the context for handling a page fault. The vCPU state can be stale if a different vCPU (or other actor) toggles APICv state after the page fault exit from the guest, in which case KVM will attempt to handle the page fault prior to servicing the pending KVM_REQ_APICV_UPDATE. However, by design such a race is benign in all scenarios. If APICv is globally enabled but locally disabled, the page fault handler will go straight to emulation (emulation of the local APIC is ok even if APIC virtualization is enabled while the guest is running). If APICv is globally disabled but locally enabled, there are two possible scenarios, and in each scenario the final outcome is correct: KVM blocks all vCPUs until SPTEs for the APIC Access Page have been zapped. (1) the page fault acquires mmu_lock before the APICv update calls kvm_zap_gfn_range(). The page fault will install a "bad" SPTE, but that SPTE will never be consumed as (a) no vCPUs can be running in the guest due to the kick from KVM_REQ_APICV_UPDATE, and (b) the "bad" SPTE is guaranteed to be zapped by kvm_zap_gfn_range() before any vCPU re-enters the guest. Because KVM_REQ_APICV_UPDATE is raised before the VM state is update, all vCPUs will block on apicv_update_lock when attempting to service the request until the lock is dropped by the update flow, _after_ the SPTE is zapped. (2) the APICv update kvm_zap_gfn_range() acquires mmu_lock before the page fault. The page fault handler will bail due to the change in mmu_notifier_seq made by kvm_zap_gfn_range(). Arguably, KVM should not (attempt to) install a SPTE based on state that is knowingly outdated, but using the per-VM state suffers a similar flaw as not checking for a pending KVM_REQ_APICV_UPDATE means that KVM will either knowingly install a SPTE that will get zapped (page fault wins the race) or will get rejected (kvm_zap_gfn_range() wins the race). However, adding a check for KVM_REQ_APICV_UPDATE is extremely undesirable as it implies the check is meaningful/sufficient, which is not the case since the page fault handler doesn't hold mmu_lock when it checks APICv status, i.e. the mmu_notifier_seq mechanism is required for correctness. Cc: Maxim Levitsky <mlevitsk@xxxxxxxxxx> Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx> --- arch/x86/kvm/mmu/mmu.c | 29 +++++++++++++++++++++++++++-- 1 file changed, 27 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index f9f228963088..6530d874f5b7 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -3911,10 +3911,35 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, * If the APIC access page exists but is disabled, go directly * to emulation without caching the MMIO access or creating a * MMIO SPTE. That way the cache doesn't need to be purged - * when the AVIC is re-enabled. + * when the AVIC is re-enabled. Note, the vCPU's APICv state + * may be stale if an APICv update is in-progress, but KVM is + * guaranteed to operate correctly in all scenarios: + * + * 1. APICv is globally enabled but locally disabled. This + * vCPU will go straight to emulation without touching the + * MMIO cache or installing a SPTE. + * + * 2a. APICv is globally disabled but locally enabled, and this + * vCPU acquires mmu_lock before the APICv update calls + * kvm_zap_gfn_range(). This vCPU will install a bad SPTE, + * but the SPTE will never be consumed as (a) no vCPUs can + * be running due to the kick from KVM_REQ_APICV_UPDATE, + * and (b) the "bad" SPTE is guaranteed to be zapped by + * kvm_zap_gfn_range() before any vCPU re-enters the guest. + * Because KVM_REQ_APICV_UPDATE is raised before the VM + * state is update, vCPUs will block on apicv_update_lock + * when attempting to service the request until the lock is + * dropped by the update flow, _after_ the SPTE is zapped. + * + * 2b. APICv is globally disabled but locally enabled, and the + * APICv update kvm_zap_gfn_range() acquires mmu_lock before + * this vCPU. This vCPU will bail from the page fault + * handler due kvm_zap_gfn_range() bumping mmu_notifier_seq, + * and will retry the faulting instruction after servicing + * the KVM_REQ_APICV_UPDATE request. */ if (slot && slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT && - !kvm_apicv_activated(vcpu->kvm)) { + !kvm_vcpu_apicv_active(vcpu)) { *r = RET_PF_EMULATE; return true; } -- 2.33.0.1079.g6e70778dc9-goog