Re: [PATCH v3 1/2] KVM: x86: Check hypercall's exit to userspace generically

Sean Christopherson <seanjc@xxxxxxxxxx> · Mon, 4 Nov 2024 18:32:06 -0800

On Mon, Nov 04, 2024, Kai Huang wrote:
> On Mon, 2024-11-04 at 16:49 +0800, Binbin Wu wrote:
> > On 11/2/2024 12:39 AM, Sean Christopherson wrote:
> > > On Fri, Nov 01, 2024, Kai Huang wrote:
> > > > On Thu, 2024-10-31 at 07:54 -0700, Sean Christopherson wrote:

...

> > > Lightly tested.  Assuming this works for TDX and passes testing, I'll post a
> > > mini-series next week.
> > > 
> > > --
> > > From: Sean Christopherson <seanjc@xxxxxxxxxx>
> > > Date: Fri, 1 Nov 2024 09:04:00 -0700
> > > Subject: [PATCH] KVM: x86: Refactor __kvm_emulate_hypercall() to accept reg
> > >   names, not values
> > > 
> > > Rework __kvm_emulate_hypercall() to take the names of input and output
> > > (guest return value) registers, as opposed to taking the input values and
> > > returning the output value.  As part of the refactor, change the actual
> > > return value from __kvm_emulate_hypercall() to be KVM's de facto standard
> > > of '0' == exit to userspace, '1' == resume guest, and -errno == failure.
> > > 
> > > Using the return value for KVM's control flow eliminates the multiplexed
> > > return value, where '0' for KVM_HC_MAP_GPA_RANGE (and only that hypercall)
> > > means "exit to userspace".
> > > 
> > > Use the direct GPR accessors to read values to avoid the pointless marking
> > > of the registers as available, but use kvm_register_write_raw() for the
> > > guest return value so that the innermost helper doesn't need to multiplex
> > > its return value.  Using the generic kvm_register_write_raw() adds very
> > > minimal overhead, so as a one-off in a relatively slow path it's well
> > > worth the code simplification.
> > > 
> > > Suggested-by: Kai Huang <kai.huang@xxxxxxxxx>
> > > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx>
> > > ---

...

> > > -	ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit, cpl);
> > > -	if (nr == KVM_HC_MAP_GPA_RANGE && !ret)
> > > -		/* MAP_GPA tosses the request to the user space. */
> > > +	r = __kvm_emulate_hypercall(vcpu, rax, rbx, rcx, rdx, rsi,
> > > +				    is_64_bit_hypercall(vcpu),
> > > +				    kvm_x86_call(get_cpl)(vcpu), RAX);
> > Now, the register for return code of the hypercall can be specified.
> > But in  ____kvm_emulate_hypercall(), the complete_userspace_io callback
> > is hardcoded to complete_hypercall_exit(), which always set return code
> > to RAX.
> > 
> > We can allow the caller to pass in the cui callback, or assign different
> > version according to the input 'ret_reg'.  So that different callers can use
> > different cui callbacks.  E.g., TDX needs to set return code to R10 in cui
> > callback.
> > 
> > How about:
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index dba78f22ab27..0fba98685f42 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -2226,13 +2226,15 @@ static inline void kvm_clear_apicv_inhibit(struct kvm *kvm,
> >   int ____kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
> >                                unsigned long a0, unsigned long a1,
> >                                unsigned long a2, unsigned long a3,
> > -                             int op_64_bit, int cpl, int ret_reg);
> > +                             int op_64_bit, int cpl, int ret_reg,
> > +                             int (*cui)(struct kvm_vcpu *vcpu));
> > 
> 
> Does below (incremental diff based on Sean's) work?
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 734dac079453..5131af97968d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10075,7 +10075,6 @@ int ____kvm_emulate_hypercall(struct kvm_vcpu *vcpu,
> unsigned long nr,
>                         vcpu->run->hypercall.flags |=
> KVM_EXIT_HYPERCALL_LONG_MODE;
>  
>                 WARN_ON_ONCE(vcpu->run->hypercall.flags &
> KVM_EXIT_HYPERCALL_MBZ);
> -               vcpu->arch.complete_userspace_io = complete_hypercall_exit;
>                 /* stat is incremented on completion. */
>                 return 0;
>         }
> @@ -10108,8 +10107,11 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>         r = __kvm_emulate_hypercall(vcpu, rax, rbx, rcx, rdx, rsi,
>                                     is_64_bit_hypercall(vcpu),
>                                     kvm_x86_call(get_cpl)(vcpu), RAX);
> -       if (r <= 0)
> +       if (r <= 0) {
> +               if (!r)
> +                       vcpu->arch.complete_userspace_io =
> complete_hypercall_exit;
>                 return 0;
> +       }

I think I prefer Binbin's version, as it forces the caller to provide cui(), i.e.
makes it harder KVM to fail to handle the backend of the hypercall.

Side topic, unless I'm missing something, all of the instruction VM-Exits that are
piped through #VMGEXIT to svm_invoke_exit_handler() will make a bogus call to
kvm_skip_emulated_instruction().  It "works", but only because nothing ever looks
at the modified data.

At first, I thought it was just VMMCALL, and so was wondering if maybe we could
fix that wart too.  But it's *every* instruction, so it's probably out of scope.

The one thing I don't love about providing a separate cui() is that it means
duplicating the guts of the completion helper.  Ha!  But we can avoid that by
adding another macro (untested).

More macros/helpers is a bit ugly too, but I like the symmetry, and it will
definitely be easier to maintain.  E.g. if the completion phase needs to pivot
on the exact hypercall, then we can update common code and don't need to remember
to go update TDX too.

If no one objects and/or has a better idea, I'll splice together Binbin's patch
with this blob, and post a series tomorrow.

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8e8ca6dab2b2..0b0fa9174000 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2179,6 +2179,16 @@ static inline void kvm_clear_apicv_inhibit(struct kvm *kvm,
        kvm_set_or_clear_apicv_inhibit(kvm, reason, false);
 }
 
+#define kvm_complete_hypercall_exit(vcpu, ret_reg)                             \
+do {                                                                           \
+       u64 ret = (vcpu)->run->hypercall.ret;                                   \
+                                                                               \
+       if (!is_64_bit_mode(vcpu))                                              \
+               ret = (u32)ret;                                                 \
+       kvm_##ret_reg##_write(vcpu, ret);                                       \
+       ++(vcpu)->stat.hypercalls;                                              \
+} while (0)
+
 int ____kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
                              unsigned long a0, unsigned long a1,
                              unsigned long a2, unsigned long a3,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 425a301911a6..aec79e132d3b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9989,12 +9989,8 @@ static void kvm_sched_yield(struct kvm_vcpu *vcpu, unsigned long dest_id)
 
 static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
 {
-       u64 ret = vcpu->run->hypercall.ret;
+       kvm_complete_hypercall_exit(vcpu, rax);
 
-       if (!is_64_bit_mode(vcpu))
-               ret = (u32)ret;
-       kvm_rax_write(vcpu, ret);
-       ++vcpu->stat.hypercalls;
        return kvm_skip_emulated_instruction(vcpu);
 }