Re: [PATCH] x86/fpu: Correct pkru/xstate inconsistency

Brian Geffon <bgeffon@xxxxxxxxxx> · Tue, 15 Feb 2022 12:50:47 -0500

Hi Dave,
Thank you for taking a look at this.

On Tue, Feb 15, 2022 at 12:13 PM Dave Hansen <dave.hansen@xxxxxxxxx> wrote:
>
> On 2/15/22 07:36, Brian Geffon wrote:
> > There are two issues with PKRU handling prior to 5.13.
>
> Are you sure both of these issues were introduced by 0cecca9d03c?  I'm
> surprised that the get_xsave_addr() issue is not older.
>
> Should this be two patches?

You're right, the get_xsave_addr() issue is much older than the eager
reloading of PKRU. I'll split this out into two patches.

>
> > The first is that when eagerly switching PKRU we check that current
>
> Don't forget to write in imperative mood.  No "we's", please.
>
> https://www.kernel.org/doc/html/latest/process/maintainer-tip.html
>
> This goes for changelogs and comments too.

This will be corrected in future patches.

>
> > is not a kernel thread as kernel threads will never use PKRU. It's
> > possible that this_cpu_read_stable() on current_task (ie.
> > get_current()) is returning an old cached value. By forcing the read
> > with this_cpu_read() the correct task is used. Without this it's
> > possible when switching from a kernel thread to a userspace thread
> > that we'll still observe the PF_KTHREAD flag and never restore the
> > PKRU. And as a result this issue only occurs when switching from a
> > kernel thread to a userspace thread, switching from a non kernel
> > thread works perfectly fine because all we consider in that situation
> > is the flags from some other non kernel task and the next fpu is
> > passed in to switch_fpu_finish().
>
> It makes *sense* that there would be a place in the context switch code
> where 'current' is wonky, but I never realized this.  This seems really
> fragile, but *also* trivially detectable.
>
> Is the PKRU code really the only code to use 'current' in a buggy way
> like this?

Yes, because the remaining code in __switch_to() references the next
task as next_p rather than current. Technically, it might be more
correct to pass next_p to switch_fpu_finish(), what do you think? This
may make sense since we're also passing the next fpu anyway.

>
> > The second issue is when using write_pkru() we only write to the
> > xstate when the feature bit is set because get_xsave_addr() returns
> > NULL when the feature bit is not set. This is problematic as the CPU
> > is free to clear the feature bit when it observes the xstate in the
> > init state, this behavior seems to be documented a few places throughout
> > the kernel. If the bit was cleared then in write_pkru() we would happily
> > write to PKRU without ever updating the xstate, and the FPU restore on
> > return to userspace would load the old value agian.
>
>
>                                                 ^ again
>
> It's probably worth noting that the AMD init tracker is a lot more
> aggressive than Intel's.  On Intel, I think XRSTOR is the only way to
> get back to the init state.  You're obviously hitting this on AMD.
>
> It's also *very* unlikely that PKRU gets back to a value of 0.  I think
> we added a selftest for this case in later kernels.
>
> That helps explain why this bug hung around for so long.
>
> > diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
> > index 03b3de491b5e..540bda5bdd28 100644
> > --- a/arch/x86/include/asm/fpu/internal.h
> > +++ b/arch/x86/include/asm/fpu/internal.h
> > @@ -598,7 +598,7 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
> >        * PKRU state is switched eagerly because it needs to be valid before we
> >        * return to userland e.g. for a copy_to_user() operation.
> >        */
> > -     if (!(current->flags & PF_KTHREAD)) {
> > +     if (!(this_cpu_read(current_task)->flags & PF_KTHREAD)) {
>
> This really deserves a specific comment.
>
> >               /*
> >                * If the PKRU bit in xsave.header.xfeatures is not set,
> >                * then the PKRU component was in init state, which means
> > diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> > index 9e71bf86d8d0..aa381b530de0 100644
> > --- a/arch/x86/include/asm/pgtable.h
> > +++ b/arch/x86/include/asm/pgtable.h
> > @@ -140,16 +140,22 @@ static inline void write_pkru(u32 pkru)
> >       if (!boot_cpu_has(X86_FEATURE_OSPKE))
> >               return;
> >
> > -     pk = get_xsave_addr(&current->thread.fpu.state.xsave, XFEATURE_PKRU);
> > -
> >       /*
> >        * The PKRU value in xstate needs to be in sync with the value that is
> >        * written to the CPU. The FPU restore on return to userland would
> >        * otherwise load the previous value again.
> >        */
> >       fpregs_lock();
> > -     if (pk)
> > -             pk->pkru = pkru;
> > +     /*
> > +      * The CPU is free to clear the feature bit when the xstate is in the
> > +      * init state. For this reason, we need to make sure the feature bit is
> > +      * reset when we're explicitly writing to pkru. If we did not then we
> > +      * would write to pkru and it would not be saved on a context switch.
> > +      */
> > +     current->thread.fpu.state.xsave.header.xfeatures |= XFEATURE_MASK_PKRU;
>
> I don't think we need to describe how the init optimization works again.
>  I'm also not sure it's worth mentioning context switches here.  It's a
> wider problem than that.  Maybe:
>
>         /*
>          * All fpregs will be XRSTOR'd from this buffer before returning
>          * to userspace.  Ensure that XRSTOR does not init PKRU and that
>          * get_xsave_addr() will work.
>          */
>
> > +     pk = get_xsave_addr(&current->thread.fpu.state.xsave, XFEATURE_PKRU);
> > +     BUG_ON(!pk);
>
> A BUG_ON() a line before a NULL pointer dereference doesn't tend to do
> much good.
>
> > +     pk->pkru = pkru;
> >       __write_pkru(pkru);
> >       fpregs_unlock();
> >  }
>

Brian