On Tue, Feb 15, 2022 at 10:57 AM Guenter Roeck <groeck@xxxxxxxxxx> wrote: > > Hi Brian, > > On Tue, Feb 15, 2022 at 7:37 AM Brian Geffon <bgeffon@xxxxxxxxxx> wrote: > > > > There are two issues with PKRU handling prior to 5.13. The first is that > > Nice catch and work. One question, though: From the above, it seems > like this patch only applies to kernels earlier than v5.13 or, more > specifically, to v5.4.y and v5.10.y. Is this correct, or should it be > applied to the upstream kernel and to all applicable stable releases ? This only applies before 5.13, so 5.10.y and 5.4.y, the behavior decoupled PKRU from xstate in a long series from Thomas Gleixner, but the first commit where this would have been fixed in 5.13 would be: commit 954436989cc550dd91aab98363240c9c0a4b7e23 Author: Thomas Gleixner <tglx@xxxxxxxxxxxxx> Date: Wed Jun 23 14:02:21 2021 +0200 x86/fpu: Remove PKRU handling from switch_fpu_finish() > > Thanks, > Guenter > > > when eagerly switching PKRU we check that current is not a kernel > > thread as kernel threads will never use PKRU. It's possible that > > this_cpu_read_stable() on current_task (ie. get_current()) is returning > > an old cached value. By forcing the read with this_cpu_read() the > > correct task is used. Without this it's possible when switching from > > a kernel thread to a userspace thread that we'll still observe the > > PF_KTHREAD flag and never restore the PKRU. And as a result this > > issue only occurs when switching from a kernel thread to a userspace > > thread, switching from a non kernel thread works perfectly fine because > > all we consider in that situation is the flags from some other non > > kernel task and the next fpu is passed in to switch_fpu_finish(). > > > > Without reloading the value finish_fpu_load() after being inlined into > > __switch_to() uses a stale value of current: > > > > ba1: 8b 35 00 00 00 00 mov 0x0(%rip),%esi > > ba7: f0 41 80 4d 01 40 lock orb $0x40,0x1(%r13) > > bad: e9 00 00 00 00 jmp bb2 <__switch_to+0x1eb> > > bb2: 41 f6 45 3e 20 testb $0x20,0x3e(%r13) > > bb7: 75 1c jne bd5 <__switch_to+0x20e> > > > > By using this_cpu_read() and avoiding the cached value the compiler does > > insert an additional load instruction and observes the correct value now: > > > > ba1: 8b 35 00 00 00 00 mov 0x0(%rip),%esi > > ba7: f0 41 80 4d 01 40 lock orb $0x40,0x1(%r13) > > bad: e9 00 00 00 00 jmp bb2 <__switch_to+0x1eb> > > bb2: 65 48 8b 05 00 00 00 mov %gs:0x0(%rip),%rax > > bb9: 00 > > bba: f6 40 3e 20 testb $0x20,0x3e(%rax) > > bbe: 75 1c jne bdc <__switch_to+0x215> > > > > The second issue is when using write_pkru() we only write to the > > xstate when the feature bit is set because get_xsave_addr() returns > > NULL when the feature bit is not set. This is problematic as the CPU > > is free to clear the feature bit when it observes the xstate in the > > init state, this behavior seems to be documented a few places throughout > > the kernel. If the bit was cleared then in write_pkru() we would happily > > write to PKRU without ever updating the xstate, and the FPU restore on > > return to userspace would load the old value agian. > > > > Fixes: 0cecca9d03c9 ("x86/fpu: Eager switch PKRU state") > > Signed-off-by: Brian Geffon <bgeffon@xxxxxxxxxx> > > Signed-off-by: Willis Kung <williskung@xxxxxxxxxx> > > Tested-by: Willis Kung <williskung@xxxxxxxxxx> > > --- > > arch/x86/include/asm/fpu/internal.h | 2 +- > > arch/x86/include/asm/pgtable.h | 14 ++++++++++---- > > 2 files changed, 11 insertions(+), 5 deletions(-) > > > > diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h > > index 03b3de491b5e..540bda5bdd28 100644 > > --- a/arch/x86/include/asm/fpu/internal.h > > +++ b/arch/x86/include/asm/fpu/internal.h > > @@ -598,7 +598,7 @@ static inline void switch_fpu_finish(struct fpu *new_fpu) > > * PKRU state is switched eagerly because it needs to be valid before we > > * return to userland e.g. for a copy_to_user() operation. > > */ > > - if (!(current->flags & PF_KTHREAD)) { > > + if (!(this_cpu_read(current_task)->flags & PF_KTHREAD)) { > > /* > > * If the PKRU bit in xsave.header.xfeatures is not set, > > * then the PKRU component was in init state, which means > > diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h > > index 9e71bf86d8d0..aa381b530de0 100644 > > --- a/arch/x86/include/asm/pgtable.h > > +++ b/arch/x86/include/asm/pgtable.h > > @@ -140,16 +140,22 @@ static inline void write_pkru(u32 pkru) > > if (!boot_cpu_has(X86_FEATURE_OSPKE)) > > return; > > > > - pk = get_xsave_addr(¤t->thread.fpu.state.xsave, XFEATURE_PKRU); > > - > > /* > > * The PKRU value in xstate needs to be in sync with the value that is > > * written to the CPU. The FPU restore on return to userland would > > * otherwise load the previous value again. > > */ > > fpregs_lock(); > > - if (pk) > > - pk->pkru = pkru; > > + /* > > + * The CPU is free to clear the feature bit when the xstate is in the > > + * init state. For this reason, we need to make sure the feature bit is > > + * reset when we're explicitly writing to pkru. If we did not then we > > + * would write to pkru and it would not be saved on a context switch. > > + */ > > + current->thread.fpu.state.xsave.header.xfeatures |= XFEATURE_MASK_PKRU; > > + pk = get_xsave_addr(¤t->thread.fpu.state.xsave, XFEATURE_PKRU); > > + BUG_ON(!pk); > > + pk->pkru = pkru; > > __write_pkru(pkru); > > fpregs_unlock(); > > } > > -- > > 2.35.1.265.g69c8d7142f-goog > >