On Wed, 2023-03-08 at 16:26 +0100, Borislav Petkov wrote: > On Mon, Feb 27, 2023 at 02:29:46PM -0800, Rick Edgecombe wrote: > > From: Yu-cheng Yu <yu-cheng.yu@xxxxxxxxx> > > > > When a process is duplicated, but the child shares the address > > space with > > the parent, there is potential for the threads sharing a single > > stack to > > cause conflicts for each other. In the normal non-cet case this is > > handled > > "non-CET" Sure. > > > in two ways. > > > > With regular CLONE_VM a new stack is provided by userspace such > > that the > > parent and child have different stacks. > > > > For vfork, the parent is suspended until the child exits. So as > > long as > > the child doesn't return from the vfork()/CLONE_VFORK calling > > function and > > sticks to a limited set of operations, the parent and child can > > share the > > same stack. > > > > For shadow stack, these scenarios present similar sharing problems. > > For the > > CLONE_VM case, the child and the parent must have separate shadow > > stacks. > > Instead of changing clone to take a shadow stack, have the kernel > > just > > allocate one and switch to it. > > > > Use stack_size passed from clone3() syscall for thread shadow stack > > size. A > > compat-mode thread shadow stack size is further reduced to 1/4. > > This > > allows more threads to run in a 32-bit address space. The clone() > > does not > > pass stack_size, which was added to clone3(). In that case, use > > RLIMIT_STACK size and cap to 4 GB. > > > > For shadow stack enabled vfork(), the parent and child can share > > the same > > shadow stack, like they can share a normal stack. Since the parent > > is > > suspended until the child terminates, the child will not interfere > > with > > the parent while executing as long as it doesn't return from the > > vfork() > > and overwrite up the shadow stack. The child can safely overwrite > > down > > the shadow stack, as the parent can just overwrite this later. So > > CET does > > not add any additional limitations for vfork(). > > > > Userspace implementing posix vfork() can actually prevent the child > > from > > "POSIX" Ok. > > ... > > > diff --git a/arch/x86/kernel/fpu/core.c > > b/arch/x86/kernel/fpu/core.c > > index f851558b673f..bc3de4aeb661 100644 > > --- a/arch/x86/kernel/fpu/core.c > > +++ b/arch/x86/kernel/fpu/core.c > > @@ -552,8 +552,41 @@ static inline void fpu_inherit_perms(struct > > fpu *dst_fpu) > > } > > } > > > > +#ifdef CONFIG_X86_USER_SHADOW_STACK > > +static int update_fpu_shstk(struct task_struct *dst, unsigned long > > ssp) > > +{ > > + struct cet_user_state *xstate; > > + > > + /* If ssp update is not needed. */ > > + if (!ssp) > > + return 0; > > + > > + xstate = get_xsave_addr(&dst->thread.fpu.fpstate->regs.xsave, > > + XFEATURE_CET_USER); > > + > > + /* > > + * If there is a non-zero ssp, then 'dst' must be configured > > with a shadow > > + * stack and the fpu state should be up to date since it was > > just copied > > + * from the parent in fpu_clone(). So there must be a valid > > non-init CET > > + * state location in the buffer. > > + */ > > + if (WARN_ON_ONCE(!xstate)) > > + return 1; > > + > > + xstate->user_ssp = (u64)ssp; > > + > > + return 0; > > +} > > +#else > > +static int update_fpu_shstk(struct task_struct *dst, unsigned long > > shstk_addr) > > ^ > ^^^^^^^^^^ > ssp, like above. > > Better yet: > > static int update_fpu_shstk(struct task_struct *dst, unsigned long > ssp) > { > #ifdef CONFIG_X86_USER_SHADOW_STACK > ... > #endif > return 0; > } > > and less ifdeffery. Sure. Sometimes people tell me to only ifdef out whole functions to make it easier to read. I suppose in this case it's not hard to see. > > > > > +{ > > + return 0; > > +} > > +#endif > > + > > /* Clone current's FPU state on fork */ > > -int fpu_clone(struct task_struct *dst, unsigned long clone_flags, > > bool minimal) > > +int fpu_clone(struct task_struct *dst, unsigned long clone_flags, > > bool minimal, > > + unsigned long ssp) > > { > > struct fpu *src_fpu = ¤t->thread.fpu; > > struct fpu *dst_fpu = &dst->thread.fpu; > > @@ -613,6 +646,12 @@ int fpu_clone(struct task_struct *dst, > > unsigned long clone_flags, bool minimal) > > if (use_xsave()) > > dst_fpu->fpstate->regs.xsave.header.xfeatures &= > > ~XFEATURE_MASK_PASID; > > > > + /* > > + * Update shadow stack pointer, in case it changed during > > clone. > > + */ > > + if (update_fpu_shstk(dst, ssp)) > > + return 1; > > + > > trace_x86_fpu_copy_src(src_fpu); > > trace_x86_fpu_copy_dst(dst_fpu); > > > > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c > > index b650cde3f64d..bf703f53fa49 100644 > > --- a/arch/x86/kernel/process.c > > +++ b/arch/x86/kernel/process.c > > @@ -48,6 +48,7 @@ > > #include <asm/frame.h> > > #include <asm/unwind.h> > > #include <asm/tdx.h> > > +#include <asm/shstk.h> > > > > #include "process.h" > > > > @@ -119,6 +120,7 @@ void exit_thread(struct task_struct *tsk) > > > > free_vm86(t); > > > > + shstk_free(tsk); > > fpu__drop(fpu); > > } > > > > @@ -140,6 +142,7 @@ int copy_thread(struct task_struct *p, const > > struct kernel_clone_args *args) > > struct inactive_task_frame *frame; > > struct fork_frame *fork_frame; > > struct pt_regs *childregs; > > + unsigned long shstk_addr = 0; > > int ret = 0; > > > > childregs = task_pt_regs(p); > > @@ -174,7 +177,13 @@ int copy_thread(struct task_struct *p, const > > struct kernel_clone_args *args) > > frame->flags = X86_EFLAGS_FIXED; > > #endif > > > > - fpu_clone(p, clone_flags, args->fn); > > + /* Allocate a new shadow stack for pthread if needed */ > > + ret = shstk_alloc_thread_stack(p, clone_flags, args- > > >stack_size, > > + &shstk_addr); > > That function will return 0 even if shstk_addr hasn't been written in > it > and you will continue merrily and call > > fpu_clone(..., shstk_addr=0); > > why don't you return the shadow stack address or negative on error > instead of adding an I/O parameter which is pretty much always nasty > to > deal with. On a shadow stack allocation error, we fail the copy_thread(). When shadow stack is enabled, the app might be able to handle a clone failure, but would not be able to handle starting a new thread without getting a new shadow stack. So in your suggestion I guess we would have two types of failure one that signifies shadow stack is enabled and the allocation failed, and another that signifies that shadow stack is not enabled, so zero needs to be passed into fpu_clone()? We need the output param in shstk_alloc_thread_stack() because we need to update the SSP to the new shadow stack. If we want to make the non- shadow stack case handled differently, I think the extra conditionals are worse, like: /* Allocate a new shadow stack for pthread if needed */ ret = shstk_alloc_thread_stack(p, clone_flags, args->stack_size, &shstk_addr); if (ret == -EOPNOTSUPP) fpu_clone(p, clone_flags, args->fn, 0); else if (ret < 0) return ret; else fpu_clone(p, clone_flags, args->fn, shstk_addr); Do you think? It used to be that shstk_alloc_thread_stack() reached into FPU internals to do the SSP update itself. Then the ability to do this was removed. So I came up with an interface for allowing features to modify XSAVE buffers from outside the FPU code. On further discussion, letting code outside the FPU have flexible access to the XSAVE buffer could constrain the FPU code from adding optimizations. So Thomas suggested to pass the SSP along into FPU code so that the FPU modification could be all monolithic and flexible. If the default SSP value logic is too hidden, what about some clearer code and comments, like this? diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index bf703f53fa49..bd123527fcca 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -142,7 +142,7 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args) struct inactive_task_frame *frame; struct fork_frame *fork_frame; struct pt_regs *childregs; - unsigned long shstk_addr = 0; + unsigned long new_ssp; int ret = 0; childregs = task_pt_regs(p); @@ -177,13 +177,18 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args) frame->flags = X86_EFLAGS_FIXED; #endif - /* Allocate a new shadow stack for pthread if needed */ + /* + * Allocate a new shadow stack for thread if needed. If shadow stack, + * is disabled, new_ssp will remain 0, and fpu_clone() will know not to + * update it. + */ + new_ssp = 0; ret = shstk_alloc_thread_stack(p, clone_flags, args- >stack_size, - &shstk_addr); + &new_ssp); if (ret) return ret; - fpu_clone(p, clone_flags, args->fn, shstk_addr); + fpu_clone(p, clone_flags, args->fn, new_ssp); /* Kernel thread ? */ if (unlikely(p->flags & PF_KTHREAD)) {