<vedvyas.shanbhogue@xxxxxxxxx>,Dave Martin <Dave.Martin@xxxxxxx>,Weijiang Yang <weijiang.yang@xxxxxxxxx>,Pengfei Xu <pengfei.xu@xxxxxxxxx>,Haitao Huang <haitao.huang@xxxxxxxxx> From: "H. Peter Anvin" <hpa@xxxxxxxxx> Message-ID: <2360408E-9967-469C-96AF-E8AC40044021@xxxxxxxxx> There are a few words at the end of the FXSAVE area for software use for this reason: so we can extend the signal frame – originally to enable saving the XSAVE state but that doesn't use up the whole software available area. On May 2, 2021 4:23:49 PM PDT, Andy Lutomirski <luto@xxxxxxxxxx> wrote: >On Fri, Apr 30, 2021 at 10:47 AM Andy Lutomirski <luto@xxxxxxxxxx> >wrote: >> >> On Fri, Apr 30, 2021 at 10:00 AM Yu, Yu-cheng <yu-cheng.yu@xxxxxxxxx> >wrote: >> > >> > On 4/28/2021 4:03 PM, Andy Lutomirski wrote: >> > > On Tue, Apr 27, 2021 at 1:44 PM Yu-cheng Yu ><yu-cheng.yu@xxxxxxxxx> wrote: >> > >> >> > >> When shadow stack is enabled, a task's shadow stack states must >be saved >> > >> along with the signal context and later restored in sigreturn. >However, >> > >> currently there is no systematic facility for extending a signal >context. >> > >> There is some space left in the ucontext, but changing ucontext >is likely >> > >> to create compatibility issues and there is not enough space for >further >> > >> extensions. >> > >> >> > >> Introduce a signal context extension struct 'sc_ext', which is >used to save >> > >> shadow stack restore token address. The extension is located >above the fpu >> > >> states, plus alignment. The struct can be extended (such as the >ibt's >> > >> wait_endbr status to be introduced later), and sc_ext.total_size >field >> > >> keeps track of total size. >> > > >> > > I still don't like this. >> > > >> > > Here's how the signal layout works, for better or for worse: >> > > >> > > The kernel has: >> > > >> > > struct rt_sigframe { >> > > char __user *pretcode; >> > > struct ucontext uc; >> > > struct siginfo info; >> > > /* fp state follows here */ >> > > }; >> > > >> > > This is roughly the actual signal frame. But userspace does not >have >> > > this struct declared, and user code does not know the sizes of >the >> > > fields. So it's accessed in a nonsensical way. The signal >handler >> > > function is passed a pointer to the whole sigframe implicitly in >RSP, >> > > a pointer to &frame->info in RSI, anda pointer to &frame->uc in >RDX. >> > > User code can *find* the fp state by following a pointer from >> > > mcontext, which is, in turn, found via uc: >> > > >> > > struct ucontext { >> > > unsigned long uc_flags; >> > > struct ucontext *uc_link; >> > > stack_t uc_stack; >> > > struct sigcontext uc_mcontext; <-- fp pointer is in here >> > > sigset_t uc_sigmask; /* mask last for extensibility >*/ >> > > }; >> > > >> > > The kernel, in sigreturn, works a bit differently. The sigreturn >> > > variants know the base address of the frame but don't have the >benefit >> > > of receiving pointers to the fields. So instead the kernel takes >> > > advantage of the fact that it knows the offset to uc and parses >uc >> > > accordingly. And the kernel follows the pointer in mcontext to >find >> > > the fp state. The latter bit is quite important later. The >kernel >> > > does not parse info at all. >> > > >> > > The fp state is its own mess. When XSAVE happened, Intel kindly >(?) >> > > gave us a software defined area between the "legacy" x87 region >and >> > > the modern supposedly extensible part. Linux sticks the >following >> > > structure in that hole: >> > > >> > > struct _fpx_sw_bytes { >> > > /* >> > > * If set to FP_XSTATE_MAGIC1 then this is an xstate >context. >> > > * 0 if a legacy frame. >> > > */ >> > > __u32 magic1; >> > > >> > > /* >> > > * Total size of the fpstate area: >> > > * >> > > * - if magic1 == 0 then it's sizeof(struct _fpstate) >> > > * - if magic1 == FP_XSTATE_MAGIC1 then it's sizeof(struct >_xstate) >> > > * plus extensions (if any) >> > > */ >> > > __u32 extended_size; >> > > >> > > /* >> > > * Feature bit mask (including FP/SSE/extended state) that >is present >> > > * in the memory layout: >> > > */ >> > > __u64 xfeatures; >> > > >> > > /* >> > > * Actual XSAVE state size, based on the xfeatures saved in >the layout. >> > > * 'extended_size' is greater than 'xstate_size': >> > > */ >> > > __u32 xstate_size; >> > > >> > > /* For future use: */ >> > > __u32 padding[7]; >> > > }; >> > > >> > > >> > > That's where we are right now upstream. The kernel has a parser >for >> > > the FPU state that is bugs piled upon bugs and is going to have >to be >> > > rewritten sometime soon. On top of all this, we have two >upcoming >> > > features, both of which require different kinds of extensions: >> > > >> > > 1. AVX-512. (Yeah, you thought this story was over a few years >ago, >> > > but no. And AMX makes it worse.) To make a long story short, we >> > > promised user code many years ago that a signal frame fit in 2048 >> > > bytes with some room to spare. With AVX-512 this is false. With >AMX >> > > it's so wrong it's not even funny. The only way out of the mess >> > > anyone has come up with involves making the length of the FPU >state >> > > vary depending on which features are INIT, i.e. making it more >compact >> > > than "compact" mode is. This has a side effect: it's no longer >> > > possible to modify the state in place, because enabling a feature >with >> > > no space allocated will make the structure bigger, and the stack >won't >> > > have room. Fortunately, one can relocate the entire FPU state, >update >> > > the pointer in mcontext, and the kernel will happily follow the >> > > pointer. So new code on a new kernel using a super-compact state >> > > could expand the state by allocating new memory (on the heap? >very >> > > awkwardly on the stack?) and changing the pointer. For all we >know, >> > > some code already fiddles with the pointer. This is great, >except >> > > that your patch sticks more data at the end of the FPU block that >no >> > > one is expecting, and your sigreturn code follows that pointer, >and >> > > will read off into lala land. >> > > >> > >> > Then, what about we don't do that at all. Is it possible from now >on we >> > don't stick more data at the end, and take the relocating-fpu >approach? >> > >> > > 2. CET. CET wants us to find a few more bytes somewhere, and >those >> > > bytes logically belong in ucontext, and here we are. >> > > >> > >> > Fortunately, we can spare CET the need of ucontext extension. When >the >> > kernel handles sigreturn, the user-mode shadow stack pointer is >right at >> > the restore token. There is no need to put that in ucontext. >> >> That seems entirely reasonable. This might also avoid needing to >> teach CRIU about CET at all. > >Wait, what's the actual shadow stack token format? And is the token >on the new stack or the old stack when sigaltstack is in use? For >that matter, is there any support for an alternate shadow stack for >signals? > >--Andy -- Sent from my Android device with K-9 Mail. Please excuse my brevity.