Re: extending ucontext (Re: [PATCH v26 25/30] x86/cet/shstk: Handle signals for shadow stack)

"H. Peter Anvin" <hpa@xxxxxxxxx> · Sun, 02 May 2021 23:03:06 -0700

<vedvyas.shanbhogue@xxxxxxxxx>,Dave Martin <Dave.Martin@xxxxxxx>,Weijiang Yang <weijiang.yang@xxxxxxxxx>,Pengfei Xu <pengfei.xu@xxxxxxxxx>,Haitao Huang <haitao.huang@xxxxxxxxx>
From: "H. Peter Anvin" <hpa@xxxxxxxxx>
Message-ID: <2360408E-9967-469C-96AF-E8AC40044021@xxxxxxxxx>

There are a few words at the end of the FXSAVE area for software use for this reason: so we can extend the signal frame – originally to enable saving the XSAVE state but that doesn't use up the whole software available area.

On May 2, 2021 4:23:49 PM PDT, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>On Fri, Apr 30, 2021 at 10:47 AM Andy Lutomirski <luto@xxxxxxxxxx>
>wrote:
>>
>> On Fri, Apr 30, 2021 at 10:00 AM Yu, Yu-cheng <yu-cheng.yu@xxxxxxxxx>
>wrote:
>> >
>> > On 4/28/2021 4:03 PM, Andy Lutomirski wrote:
>> > > On Tue, Apr 27, 2021 at 1:44 PM Yu-cheng Yu
><yu-cheng.yu@xxxxxxxxx> wrote:
>> > >>
>> > >> When shadow stack is enabled, a task's shadow stack states must
>be saved
>> > >> along with the signal context and later restored in sigreturn. 
>However,
>> > >> currently there is no systematic facility for extending a signal
>context.
>> > >> There is some space left in the ucontext, but changing ucontext
>is likely
>> > >> to create compatibility issues and there is not enough space for
>further
>> > >> extensions.
>> > >>
>> > >> Introduce a signal context extension struct 'sc_ext', which is
>used to save
>> > >> shadow stack restore token address.  The extension is located
>above the fpu
>> > >> states, plus alignment.  The struct can be extended (such as the
>ibt's
>> > >> wait_endbr status to be introduced later), and sc_ext.total_size
>field
>> > >> keeps track of total size.
>> > >
>> > > I still don't like this.
>> > >
>> > > Here's how the signal layout works, for better or for worse:
>> > >
>> > > The kernel has:
>> > >
>> > > struct rt_sigframe {
>> > >      char __user *pretcode;
>> > >      struct ucontext uc;
>> > >      struct siginfo info;
>> > >      /* fp state follows here */
>> > > };
>> > >
>> > > This is roughly the actual signal frame.  But userspace does not
>have
>> > > this struct declared, and user code does not know the sizes of
>the
>> > > fields.  So it's accessed in a nonsensical way.  The signal
>handler
>> > > function is passed a pointer to the whole sigframe implicitly in
>RSP,
>> > > a pointer to &frame->info in RSI, anda pointer to &frame->uc in
>RDX.
>> > > User code can *find* the fp state by following a pointer from
>> > > mcontext, which is, in turn, found via uc:
>> > >
>> > > struct ucontext {
>> > >      unsigned long      uc_flags;
>> > >      struct ucontext  *uc_link;
>> > >      stack_t          uc_stack;
>> > >      struct sigcontext uc_mcontext;  <-- fp pointer is in here
>> > >      sigset_t      uc_sigmask;    /* mask last for extensibility
>*/
>> > > };
>> > >
>> > > The kernel, in sigreturn, works a bit differently.  The sigreturn
>> > > variants know the base address of the frame but don't have the
>benefit
>> > > of receiving pointers to the fields.  So instead the kernel takes
>> > > advantage of the fact that it knows the offset to uc and parses
>uc
>> > > accordingly.  And the kernel follows the pointer in mcontext to
>find
>> > > the fp state.  The latter bit is quite important later.  The
>kernel
>> > > does not parse info at all.
>> > >
>> > > The fp state is its own mess.  When XSAVE happened, Intel kindly
>(?)
>> > > gave us a software defined area between the "legacy" x87 region
>and
>> > > the modern supposedly extensible part.  Linux sticks the
>following
>> > > structure in that hole:
>> > >
>> > > struct _fpx_sw_bytes {
>> > >      /*
>> > >       * If set to FP_XSTATE_MAGIC1 then this is an xstate
>context.
>> > >       * 0 if a legacy frame.
>> > >       */
>> > >      __u32                magic1;
>> > >
>> > >      /*
>> > >       * Total size of the fpstate area:
>> > >       *
>> > >       *  - if magic1 == 0 then it's sizeof(struct _fpstate)
>> > >       *  - if magic1 == FP_XSTATE_MAGIC1 then it's sizeof(struct
>_xstate)
>> > >       *    plus extensions (if any)
>> > >       */
>> > >      __u32                extended_size;
>> > >
>> > >      /*
>> > >       * Feature bit mask (including FP/SSE/extended state) that
>is present
>> > >       * in the memory layout:
>> > >       */
>> > >      __u64                xfeatures;
>> > >
>> > >      /*
>> > >       * Actual XSAVE state size, based on the xfeatures saved in
>the layout.
>> > >       * 'extended_size' is greater than 'xstate_size':
>> > >       */
>> > >      __u32                xstate_size;
>> > >
>> > >      /* For future use: */
>> > >      __u32                padding[7];
>> > > };
>> > >
>> > >
>> > > That's where we are right now upstream.  The kernel has a parser
>for
>> > > the FPU state that is bugs piled upon bugs and is going to have
>to be
>> > > rewritten sometime soon.  On top of all this, we have two
>upcoming
>> > > features, both of which require different kinds of extensions:
>> > >
>> > > 1. AVX-512.  (Yeah, you thought this story was over a few years
>ago,
>> > > but no.  And AMX makes it worse.)  To make a long story short, we
>> > > promised user code many years ago that a signal frame fit in 2048
>> > > bytes with some room to spare.  With AVX-512 this is false.  With
>AMX
>> > > it's so wrong it's not even funny.  The only way out of the mess
>> > > anyone has come up with involves making the length of the FPU
>state
>> > > vary depending on which features are INIT, i.e. making it more
>compact
>> > > than "compact" mode is.  This has a side effect: it's no longer
>> > > possible to modify the state in place, because enabling a feature
>with
>> > > no space allocated will make the structure bigger, and the stack
>won't
>> > > have room.  Fortunately, one can relocate the entire FPU state,
>update
>> > > the pointer in mcontext, and the kernel will happily follow the
>> > > pointer.  So new code on a new kernel using a super-compact state
>> > > could expand the state by allocating new memory (on the heap?
>very
>> > > awkwardly on the stack?) and changing the pointer.  For all we
>know,
>> > > some code already fiddles with the pointer.  This is great,
>except
>> > > that your patch sticks more data at the end of the FPU block that
>no
>> > > one is expecting, and your sigreturn code follows that pointer,
>and
>> > > will read off into lala land.
>> > >
>> >
>> > Then, what about we don't do that at all.  Is it possible from now
>on we
>> > don't stick more data at the end, and take the relocating-fpu
>approach?
>> >
>> > > 2. CET.  CET wants us to find a few more bytes somewhere, and
>those
>> > > bytes logically belong in ucontext, and here we are.
>> > >
>> >
>> > Fortunately, we can spare CET the need of ucontext extension.  When
>the
>> > kernel handles sigreturn, the user-mode shadow stack pointer is
>right at
>> > the restore token.  There is no need to put that in ucontext.
>>
>> That seems entirely reasonable.  This might also avoid needing to
>> teach CRIU about CET at all.
>
>Wait, what's the actual shadow stack token format?  And is the token
>on the new stack or the old stack when sigaltstack is in use?  For
>that matter, is there any support for an alternate shadow stack for
>signals?
>
>--Andy

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.