> On Thu, Jul 18, 2024, H. Peter Anvin wrote: > > On July 12, 2024 8:12:51 AM PDT, Sean Christopherson <seanjc@xxxxxxxxxx> > wrote: > > >On Wed, Jul 10, 2024, Xin3 Li wrote: > > >> > On Wed, Feb 07, 2024, Xin Li wrote: > > >> > > Switch MSR_IA32_FRED_RSP0 between host and guest in > > >> > > vmx_prepare_switch_to_{host,guest}(). > > >> > > > > >> > > MSR_IA32_FRED_RSP0 is used during ring 3 event delivery only, > > >> > > thus KVM, running on ring 0, can run safely with guest FRED > > >> > > RSP0, i.e., no need to switch between host/guest FRED RSP0 during VM > entry and exit. > > >> > > > > >> > > KVM should switch to host FRED RSP0 before returning to user > > >> > > level, and switch to guest FRED RSP0 before entering guest mode. > > >> > > > >> > Heh, if only KVM had a framework that was specifically designed > > >> > for context switching MSRs on return to userspace. Translation: > > >> > please use the > > >> > user_return_msr() APIs. > > >> > > >> IIUC the user return MSR framework works for MSRs that are per CPU > > >> constants, but like MSR_KERNEL_GS_BASE, MSR_IA32_FRED_RSP0 is a per > > >> *task* constant, thus we can't use it. > > > > > >Ah, in that case, the changelog is very misleading and needs to be fixed. > > >Alternatively, is the desired RSP0 value tracked anywhere other than the MSR? > > >E.g. if it's somewhere in task_struct, then kvm_on_user_return() > > >would restore the current task's desired RSP0. Even if we don't get > > >fancy, avoiding the RDMSR to get the current task's value would be nice. > > > > Hm, perhaps the right thing to do is to always invoke this function > > before a context switch happens if that happens before return to user space? > > Actually, if the _TIF_NEED_RSP0_LOAD doesn't provide a meaningful benefit (or > y'all just don't want it :-) ), We want it 😊. My concern was adding an extra check of (ti_work & _TIF_NEED_RSP0_LOAD) into a hot function arch_exit_to_user_mode_prepare(). HPA checked the function and suggested to test ti_work for zero and then process individual bits in it: diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h index fb2809b20b0a..4c78b99060b5 100644 --- a/arch/x86/include/asm/entry-common.h +++ b/arch/x86/include/asm/entry-common.h @@ -47,15 +47,17 @@ static __always_inline void arch_enter_from_user_mode(struct pt_regs *regs) static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs, unsigned long ti_work) { - if (ti_work & _TIF_USER_RETURN_NOTIFY) - fire_user_return_notifiers(); + if (ti_work) { + if (ti_work & _TIF_USER_RETURN_NOTIFY) + fire_user_return_notifiers(); - if (unlikely(ti_work & _TIF_IO_BITMAP)) - tss_update_io_bitmap(); + if (unlikely(ti_work & _TIF_IO_BITMAP)) + tss_update_io_bitmap(); - fpregs_assert_state_consistent(); - if (unlikely(ti_work & _TIF_NEED_FPU_LOAD)) - switch_fpu_return(); + fpregs_assert_state_consistent(); + if (unlikely(ti_work & _TIF_NEED_FPU_LOAD)) + switch_fpu_return(); + } #ifdef CONFIG_COMPAT /* Based on it, I measured how many 0s are out of every one million ti_work values in kernel build tests, it's over 99%, i.e., unlikely(ti_work). When booting a KVM guest, it becomes 75%, which is expected. After the guest is up running kernel build in it, it's 99% again. So at least this patch seems a low-hanging fruit, and I have sent it to Intel 0day for broader perf tests. As context switches are way less frequent than exit to user mode, I do NOT expect it makes a difference to write MSR_IA32_FRED_RSP0 on exit to user mode instead of on context switch especially when we do it on top of the above patch.