RE: [PATCH v2 09/25] KVM: VMX: Switch FRED RSP0 between host and guest

"Li, Xin3" <xin3.li@xxxxxxxxx> · Sun, 21 Jul 2024 18:09:01 +0000

> On Thu, Jul 18, 2024, H. Peter Anvin wrote:
> > On July 12, 2024 8:12:51 AM PDT, Sean Christopherson <seanjc@xxxxxxxxxx>
> wrote:
> > >On Wed, Jul 10, 2024, Xin3 Li wrote:
> > >> > On Wed, Feb 07, 2024, Xin Li wrote:
> > >> > > Switch MSR_IA32_FRED_RSP0 between host and guest in
> > >> > > vmx_prepare_switch_to_{host,guest}().
> > >> > >
> > >> > > MSR_IA32_FRED_RSP0 is used during ring 3 event delivery only,
> > >> > > thus KVM, running on ring 0, can run safely with guest FRED
> > >> > > RSP0, i.e., no need to switch between host/guest FRED RSP0 during VM
> entry and exit.
> > >> > >
> > >> > > KVM should switch to host FRED RSP0 before returning to user
> > >> > > level, and switch to guest FRED RSP0 before entering guest mode.
> > >> >
> > >> > Heh, if only KVM had a framework that was specifically designed
> > >> > for context switching MSRs on return to userspace.  Translation:
> > >> > please use the
> > >> > user_return_msr() APIs.
> > >>
> > >> IIUC the user return MSR framework works for MSRs that are per CPU
> > >> constants, but like MSR_KERNEL_GS_BASE, MSR_IA32_FRED_RSP0 is a per
> > >> *task* constant, thus we can't use it.
> > >
> > >Ah, in that case, the changelog is very misleading and needs to be fixed.
> > >Alternatively, is the desired RSP0 value tracked anywhere other than the MSR?
> > >E.g. if it's somewhere in task_struct, then kvm_on_user_return()
> > >would restore the current task's desired RSP0.  Even if we don't get
> > >fancy, avoiding the RDMSR to get the current task's value would be nice.
> >
> > Hm, perhaps the right thing to do is to always invoke this function
> > before a context switch happens if that happens before return to user space?
> 
> Actually, if the _TIF_NEED_RSP0_LOAD doesn't provide a meaningful benefit (or
> y'all just don't want it :-) ), 

We want it 😊.

My concern was adding an extra check of (ti_work & _TIF_NEED_RSP0_LOAD)
into a hot function arch_exit_to_user_mode_prepare().  HPA checked the
function and suggested to test ti_work for zero and then process
individual bits in it:

diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index fb2809b20b0a..4c78b99060b5 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -47,15 +47,17 @@ static __always_inline void arch_enter_from_user_mode(struct pt_regs *regs)
 static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
                                                  unsigned long ti_work)
 {
-       if (ti_work & _TIF_USER_RETURN_NOTIFY)
-               fire_user_return_notifiers();
+       if (ti_work) {
+               if (ti_work & _TIF_USER_RETURN_NOTIFY)
+                       fire_user_return_notifiers();

-       if (unlikely(ti_work & _TIF_IO_BITMAP))
-               tss_update_io_bitmap();
+               if (unlikely(ti_work & _TIF_IO_BITMAP))
+                       tss_update_io_bitmap();

-       fpregs_assert_state_consistent();
-       if (unlikely(ti_work & _TIF_NEED_FPU_LOAD))
-               switch_fpu_return();
+               fpregs_assert_state_consistent();
+               if (unlikely(ti_work & _TIF_NEED_FPU_LOAD))
+                       switch_fpu_return();
+       }

 #ifdef CONFIG_COMPAT
        /*

Based on it, I measured how many 0s are out of every one million ti_work
values in kernel build tests, it's over 99%, i.e., unlikely(ti_work).

When booting a KVM guest, it becomes 75%, which is expected.  After the
guest is up running kernel build in it, it's 99% again.

So at least this patch seems a low-hanging fruit, and I have sent it to
Intel 0day for broader perf tests.

As context switches are way less frequent than exit to user mode, I do
NOT expect it makes a difference to write MSR_IA32_FRED_RSP0 on exit to
user mode instead of on context switch especially when we do it on top
of the above patch.