Hi Dave, Thank you for taking a look at this. On Tue, Feb 15, 2022 at 12:13 PM Dave Hansen <dave.hansen@xxxxxxxxx> wrote: > > On 2/15/22 07:36, Brian Geffon wrote: > > There are two issues with PKRU handling prior to 5.13. > > Are you sure both of these issues were introduced by 0cecca9d03c? I'm > surprised that the get_xsave_addr() issue is not older. > > Should this be two patches? You're right, the get_xsave_addr() issue is much older than the eager reloading of PKRU. I'll split this out into two patches. > > > The first is that when eagerly switching PKRU we check that current > > Don't forget to write in imperative mood. No "we's", please. > > https://www.kernel.org/doc/html/latest/process/maintainer-tip.html > > This goes for changelogs and comments too. This will be corrected in future patches. > > > is not a kernel thread as kernel threads will never use PKRU. It's > > possible that this_cpu_read_stable() on current_task (ie. > > get_current()) is returning an old cached value. By forcing the read > > with this_cpu_read() the correct task is used. Without this it's > > possible when switching from a kernel thread to a userspace thread > > that we'll still observe the PF_KTHREAD flag and never restore the > > PKRU. And as a result this issue only occurs when switching from a > > kernel thread to a userspace thread, switching from a non kernel > > thread works perfectly fine because all we consider in that situation > > is the flags from some other non kernel task and the next fpu is > > passed in to switch_fpu_finish(). > > It makes *sense* that there would be a place in the context switch code > where 'current' is wonky, but I never realized this. This seems really > fragile, but *also* trivially detectable. > > Is the PKRU code really the only code to use 'current' in a buggy way > like this? Yes, because the remaining code in __switch_to() references the next task as next_p rather than current. Technically, it might be more correct to pass next_p to switch_fpu_finish(), what do you think? This may make sense since we're also passing the next fpu anyway. > > > The second issue is when using write_pkru() we only write to the > > xstate when the feature bit is set because get_xsave_addr() returns > > NULL when the feature bit is not set. This is problematic as the CPU > > is free to clear the feature bit when it observes the xstate in the > > init state, this behavior seems to be documented a few places throughout > > the kernel. If the bit was cleared then in write_pkru() we would happily > > write to PKRU without ever updating the xstate, and the FPU restore on > > return to userspace would load the old value agian. > > > ^ again > > It's probably worth noting that the AMD init tracker is a lot more > aggressive than Intel's. On Intel, I think XRSTOR is the only way to > get back to the init state. You're obviously hitting this on AMD. > > It's also *very* unlikely that PKRU gets back to a value of 0. I think > we added a selftest for this case in later kernels. > > That helps explain why this bug hung around for so long. > > > diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h > > index 03b3de491b5e..540bda5bdd28 100644 > > --- a/arch/x86/include/asm/fpu/internal.h > > +++ b/arch/x86/include/asm/fpu/internal.h > > @@ -598,7 +598,7 @@ static inline void switch_fpu_finish(struct fpu *new_fpu) > > * PKRU state is switched eagerly because it needs to be valid before we > > * return to userland e.g. for a copy_to_user() operation. > > */ > > - if (!(current->flags & PF_KTHREAD)) { > > + if (!(this_cpu_read(current_task)->flags & PF_KTHREAD)) { > > This really deserves a specific comment. > > > /* > > * If the PKRU bit in xsave.header.xfeatures is not set, > > * then the PKRU component was in init state, which means > > diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h > > index 9e71bf86d8d0..aa381b530de0 100644 > > --- a/arch/x86/include/asm/pgtable.h > > +++ b/arch/x86/include/asm/pgtable.h > > @@ -140,16 +140,22 @@ static inline void write_pkru(u32 pkru) > > if (!boot_cpu_has(X86_FEATURE_OSPKE)) > > return; > > > > - pk = get_xsave_addr(¤t->thread.fpu.state.xsave, XFEATURE_PKRU); > > - > > /* > > * The PKRU value in xstate needs to be in sync with the value that is > > * written to the CPU. The FPU restore on return to userland would > > * otherwise load the previous value again. > > */ > > fpregs_lock(); > > - if (pk) > > - pk->pkru = pkru; > > + /* > > + * The CPU is free to clear the feature bit when the xstate is in the > > + * init state. For this reason, we need to make sure the feature bit is > > + * reset when we're explicitly writing to pkru. If we did not then we > > + * would write to pkru and it would not be saved on a context switch. > > + */ > > + current->thread.fpu.state.xsave.header.xfeatures |= XFEATURE_MASK_PKRU; > > I don't think we need to describe how the init optimization works again. > I'm also not sure it's worth mentioning context switches here. It's a > wider problem than that. Maybe: > > /* > * All fpregs will be XRSTOR'd from this buffer before returning > * to userspace. Ensure that XRSTOR does not init PKRU and that > * get_xsave_addr() will work. > */ > > > + pk = get_xsave_addr(¤t->thread.fpu.state.xsave, XFEATURE_PKRU); > > + BUG_ON(!pk); > > A BUG_ON() a line before a NULL pointer dereference doesn't tend to do > much good. > > > + pk->pkru = pkru; > > __write_pkru(pkru); > > fpregs_unlock(); > > } > Brian