Add a kvm_run flag, KVM_RUN_NEEDS_COMPLETION, to communicate to userspace that KVM_RUN needs to be re-executed prior to save/restore in order to complete the instruction/operation that triggered the userspace exit. KVM's current approach of adding notes in the Documentation is beyond brittle, e.g. there is at least one known case where a KVM developer added a new userspace exit type, and then that same developer forgot to handle completion when adding userspace support. On x86, there are multiple exits that need completion, but they are all conveniently funneled through a single callback, i.e. in theory, this is a one-time thing for KVM x86 (and other architectures could follow suit with additional refactoring). Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx> --- Documentation/virt/kvm/api.rst | 48 ++++++++++++++++++++++--------- arch/powerpc/kvm/book3s_emulate.c | 1 + arch/powerpc/kvm/book3s_hv.c | 1 + arch/powerpc/kvm/book3s_pr.c | 2 ++ arch/powerpc/kvm/booke.c | 1 + arch/x86/include/uapi/asm/kvm.h | 7 +++-- arch/x86/kvm/x86.c | 2 ++ include/uapi/linux/kvm.h | 3 ++ virt/kvm/kvm_main.c | 1 + 9 files changed, 49 insertions(+), 17 deletions(-) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index c92c8d4e8779..8e172675d8d6 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -6505,7 +6505,7 @@ local APIC is not used. __u16 flags; -More architecture-specific flags detailing state of the VCPU that may +Common and architecture-specific flags detailing state of the VCPU that may affect the device's behavior. Current defined flags:: /* x86, set if the VCPU is in system management mode */ @@ -6518,6 +6518,8 @@ affect the device's behavior. Current defined flags:: /* arm64, set for KVM_EXIT_DEBUG */ #define KVM_DEBUG_ARCH_HSR_HIGH_VALID (1 << 0) + /* all architectures, set when the exit needs completion (via KVM_RUN) */ + #define KVM_RUN_NEEDS_COMPLETION (1 << 15) :: /* in (pre_kvm_run), out (post_kvm_run) */ @@ -6632,19 +6634,10 @@ requires a guest to interact with host userspace. .. note:: - For KVM_EXIT_IO, KVM_EXIT_MMIO, KVM_EXIT_OSI, KVM_EXIT_PAPR, KVM_EXIT_XEN, - KVM_EXIT_EPR, KVM_EXIT_X86_RDMSR, KVM_EXIT_X86_WRMSR, and KVM_EXIT_HYPERCALL - the corresponding operations are complete (and guest state is consistent) - only after userspace has re-entered the kernel with KVM_RUN. The kernel - side will first finish incomplete operations and then check for pending - signals. + For some exits, userspace must re-enter the kernel with KVM_RUN to + complete the exit and ensure guest state is consistent. - The pending state of the operation is not preserved in state which is - visible to userspace, thus userspace should ensure that the operation is - completed before performing a live migration. Userspace can re-enter the - guest with an unmasked signal pending or with the immediate_exit field set - to complete pending operations without allowing any further instructions - to be executed. + See KVM_CAP_NEEDS_COMPLETION for details. :: @@ -8239,7 +8232,7 @@ Note: Userspace is responsible for correctly configuring CPUID 0x15, a.k.a. the core crystal clock frequency, if a non-zero CPUID 0x15 is exposed to the guest. 7.36 KVM_CAP_X86_GUEST_MODE ------------------------------- +--------------------------- :Architectures: x86 :Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP. @@ -8252,6 +8245,33 @@ KVM exits with the register state of either the L1 or L2 guest depending on which executed at the time of an exit. Userspace must take care to differentiate between these cases. +7.37 KVM_CAP_NEEDS_COMPLETION +----------------------------- + +:Architectures: all +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP. + +The presence of this capability indicates that KVM_RUN will set +KVM_RUN_NEEDS_COMPLETION in kvm_run.flags if KVM requires userspace to re-enter +the kernel KVM_RUN to complete the exit. + +For select exits, userspace must re-enter the kernel with KVM_RUN to complete +the corresponding operation, only after which is guest state guaranteed to be +consistent. On such a KVM_RUN, the kernel side will first finish incomplete +operations and then check for pending signals. + +The pending state of the operation for such exits is not preserved in state +which is visible to userspace, thus userspace should ensure that the operation +is completed before performing state save/restore, e.g. for live migration. +Userspace can re-enter the guest with an unmasked signal pending or with the +immediate_exit field set to complete pending operations without allowing any +further instructions to be executed. + +Without KVM_CAP_NEEDS_COMPLETION, KVM_RUN_NEEDS_COMPLETION will never be set +and userspace must assume that exits of type KVM_EXIT_IO, KVM_EXIT_MMIO, +KVM_EXIT_OSI, KVM_EXIT_PAPR, KVM_EXIT_XEN, KVM_EXIT_EPR, KVM_EXIT_X86_RDMSR, +KVM_EXIT_X86_WRMSR, and KVM_EXIT_HYPERCALL require completion. + 8. Other capabilities. ====================== diff --git a/arch/powerpc/kvm/book3s_emulate.c b/arch/powerpc/kvm/book3s_emulate.c index de126d153328..15014a66c167 100644 --- a/arch/powerpc/kvm/book3s_emulate.c +++ b/arch/powerpc/kvm/book3s_emulate.c @@ -374,6 +374,7 @@ int kvmppc_core_emulate_op_pr(struct kvm_vcpu *vcpu, } vcpu->run->exit_reason = KVM_EXIT_PAPR_HCALL; + vcpu->run->flags |= KVM_RUN_NEEDS_COMPLETION; vcpu->arch.hcall_needed = 1; emulated = EMULATE_EXIT_USER; break; diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index b253f7372774..05ad0c3494f1 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -1767,6 +1767,7 @@ static int kvmppc_handle_exit_hv(struct kvm_vcpu *vcpu, for (i = 0; i < 9; ++i) run->papr_hcall.args[i] = kvmppc_get_gpr(vcpu, 4 + i); run->exit_reason = KVM_EXIT_PAPR_HCALL; + run->flags |= KVM_RUN_NEEDS_COMPLETION; vcpu->arch.hcall_needed = 1; r = RESUME_HOST; break; diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c index 83bcdc80ce51..c45beb64905a 100644 --- a/arch/powerpc/kvm/book3s_pr.c +++ b/arch/powerpc/kvm/book3s_pr.c @@ -1310,6 +1310,7 @@ int kvmppc_handle_exit_pr(struct kvm_vcpu *vcpu, unsigned int exit_nr) run->papr_hcall.args[i] = gpr; } run->exit_reason = KVM_EXIT_PAPR_HCALL; + run->flags |= KVM_RUN_NEEDS_COMPLETION; vcpu->arch.hcall_needed = 1; r = RESUME_HOST; } else if (vcpu->arch.osi_enabled && @@ -1320,6 +1321,7 @@ int kvmppc_handle_exit_pr(struct kvm_vcpu *vcpu, unsigned int exit_nr) int i; run->exit_reason = KVM_EXIT_OSI; + run->flags |= KVM_RUN_NEEDS_COMPLETION; for (i = 0; i < 32; i++) gprs[i] = kvmppc_get_gpr(vcpu, i); vcpu->arch.osi_needed = 1; diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index 6a5be025a8af..3a0e00178fa5 100644 --- a/arch/powerpc/kvm/booke.c +++ b/arch/powerpc/kvm/booke.c @@ -751,6 +751,7 @@ int kvmppc_core_check_requests(struct kvm_vcpu *vcpu) vcpu->run->epr.epr = 0; vcpu->arch.epr_needed = true; vcpu->run->exit_reason = KVM_EXIT_EPR; + vcpu->run->flags |= KVM_RUN_NEEDS_COMPLETION; r = 0; } diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h index 88585c1de416..e2ec32a8970c 100644 --- a/arch/x86/include/uapi/asm/kvm.h +++ b/arch/x86/include/uapi/asm/kvm.h @@ -104,9 +104,10 @@ struct kvm_ioapic_state { #define KVM_IRQCHIP_IOAPIC 2 #define KVM_NR_IRQCHIPS 3 -#define KVM_RUN_X86_SMM (1 << 0) -#define KVM_RUN_X86_BUS_LOCK (1 << 1) -#define KVM_RUN_X86_GUEST_MODE (1 << 2) +#define KVM_RUN_X86_SMM (1 << 0) +#define KVM_RUN_X86_BUS_LOCK (1 << 1) +#define KVM_RUN_X86_GUEST_MODE (1 << 2) +#define KVM_RUN_X86_NEEDS_COMPLETION (1 << 2) /* for KVM_GET_REGS and KVM_SET_REGS */ struct kvm_regs { diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a8aa12e0911d..67034b089371 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10154,6 +10154,8 @@ static void post_kvm_run_save(struct kvm_vcpu *vcpu) kvm_run->flags |= KVM_RUN_X86_SMM; if (is_guest_mode(vcpu)) kvm_run->flags |= KVM_RUN_X86_GUEST_MODE; + if (vcpu->arch.complete_userspace_io) + kvm_run->flags |= KVM_RUN_NEEDS_COMPLETION; } static void update_cr8_intercept(struct kvm_vcpu *vcpu) diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 343de0a51797..91dbee3ae0bc 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -192,6 +192,8 @@ struct kvm_xen_exit { /* Flags that describe what fields in emulation_failure hold valid data. */ #define KVM_INTERNAL_ERROR_EMULATION_FLAG_INSTRUCTION_BYTES (1ULL << 0) +#define KVM_RUN_NEEDS_COMPLETION (1 << 15) + /* * struct kvm_run can be modified by userspace at any time, so KVM must be * careful to avoid TOCTOU bugs. In order to protect KVM, HINT_UNSAFE_IN_KVM() @@ -933,6 +935,7 @@ struct kvm_enable_cap { #define KVM_CAP_PRE_FAULT_MEMORY 236 #define KVM_CAP_X86_APIC_BUS_CYCLES_NS 237 #define KVM_CAP_X86_GUEST_MODE 238 +#define KVM_CAP_NEEDS_COMPLETION 239 struct kvm_irq_routing_irqchip { __u32 irqchip; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 7d2076439081..28aa89e5ba85 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -4746,6 +4746,7 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg) case KVM_CAP_CHECK_EXTENSION_VM: case KVM_CAP_ENABLE_CAP_VM: case KVM_CAP_HALT_POLL: + case KVM_CAP_NEEDS_COMPLETION: return 1; #ifdef CONFIG_KVM_MMIO case KVM_CAP_COALESCED_MMIO: -- 2.47.1.613.gc27f4b7a9f-goog