On 1/13/2025 10:09 AM, Chao Gao wrote:
On Fri, Jan 10, 2025 at 05:24:48PM -0800, Sean Christopherson wrote:
Add a kvm_run flag, KVM_RUN_NEEDS_COMPLETION, to communicate to userspace
that KVM_RUN needs to be re-executed prior to save/restore in order to
complete the instruction/operation that triggered the userspace exit.
KVM's current approach of adding notes in the Documentation is beyond
brittle, e.g. there is at least one known case where a KVM developer added
a new userspace exit type, and then that same developer forgot to handle
completion when adding userspace support.
This answers one question I had:
https://lore.kernel.org/kvm/Z1bmUCEdoZ87wIMn@xxxxxxxxx/
In current QEMU code, it always returns back to KVM via KVM_RUN after it
successfully handled a KVM exit reason, no matter what the exit reason is.
The complete_userspace_io() callback will be called if it has been setup.
So if a new kvm exit reason is added in QEMU, it seems QEMU doesn't need
special handing to make the complete_userspace_io() callback be called.
However, QEMU is not the only userspace VMM that supports KVM, it makes
sense to make the solution generic and clear for different userspace VMMs.
Regarding the support of MapGPA for TDX when live migration is considered,
since a big range will be split into 2MB chunks, in order the status is
right after TD live migration, it needs to set the return code to retry
with the next_gpa in the complete_userspace_io() callback if vcpu->wants_to_run
is false or vcpu->run->immediate_exit__unsafe is set, otherwise, TDX guest
will see return code as successful and think the whole range has been converted
successfully.
@@ -1093,7 +1093,8 @@ static int tdx_complete_vmcall_map_gpa(struct kvm_vcpu *vcpu)
* immediately after STI or MOV/POP SS.
*/
if (pi_has_pending_interrupt(vcpu) ||
- kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending) {
+ kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending ||
+ !vcpu->wants_to_run) {
tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_RETRY);
tdx->vp_enter_args.r11 = tdx->map_gpa_next;
return 1;
Of course, it can be addressed later when TD live migration is supported.
So, it is the VMM's (i.e., QEMU's) responsibility to re-execute KVM_RUN in this
case.
Btw, can this flag be used to address the issue [*] with steal time accounting?
We can set the new flag for each vCPU in the PM notifier and we need to change
the re-execution to handle steal time accounting (not just IO completion).
[*]: https://lore.kernel.org/kvm/Z36XJl1OAahVkxhl@xxxxxxxxxx/
one nit below,
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -104,9 +104,10 @@ struct kvm_ioapic_state {
#define KVM_IRQCHIP_IOAPIC 2
#define KVM_NR_IRQCHIPS 3
-#define KVM_RUN_X86_SMM (1 << 0)
-#define KVM_RUN_X86_BUS_LOCK (1 << 1)
-#define KVM_RUN_X86_GUEST_MODE (1 << 2)
+#define KVM_RUN_X86_SMM (1 << 0)
+#define KVM_RUN_X86_BUS_LOCK (1 << 1)
+#define KVM_RUN_X86_GUEST_MODE (1 << 2)
+#define KVM_RUN_X86_NEEDS_COMPLETION (1 << 2)
This X86_NEEDS_COMPLETION should be dropped. It is never used.