On Mon, Nov 4, 2019 at 4:31 AM Mark Rutland <mark.rutland@xxxxxxx> wrote: > > +/* > > + * In testing, 1 KiB shadow stack size (i.e. 128 stack frames on a 64-bit > > + * architecture) provided ~40% safety margin on stack usage while keeping > > + * memory allocation overhead reasonable. > > + */ > > +#define SCS_SIZE 1024 > > To make it easier to reason about type promotion rules (and avoid that > we accidentaly mask out high bits when using this to generate a mask), > can we please make this 1024UL? Sure. > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -6013,6 +6013,8 @@ void init_idle(struct task_struct *idle, int cpu) > > raw_spin_lock_irqsave(&idle->pi_lock, flags); > > raw_spin_lock(&rq->lock); > > > > + scs_task_reset(idle); > > Could we please do this next to the kasan_unpoison_task_stack() call, > Either just before, or just after? > > They're boot addressing the same issue where previously live stack is > being reused, and in general I'd expect them to occur at the same time > (though I understand idle will be a bit different). Good point, I'll move this. > > --- a/kernel/sched/sched.h > > +++ b/kernel/sched/sched.h > > @@ -58,6 +58,7 @@ > > #include <linux/profile.h> > > #include <linux/psi.h> > > #include <linux/rcupdate_wait.h> > > +#include <linux/scs.h> > > #include <linux/security.h> > > #include <linux/stop_machine.h> > > #include <linux/suspend.h> > > This include looks extraneous. I added this to sched.h, because most of the includes used in kernel/sched appear to be there, but I can move this to kernel/sched/core.c instead. > > +static inline void *__scs_base(struct task_struct *tsk) > > +{ > > + /* > > + * We allow architectures to use the shadow_call_stack field in > > + * struct thread_info to store the current shadow stack pointer > > + * during context switches. > > + * > > + * This allows the implementation to also clear the field when > > + * the task is active to avoid keeping pointers to the current > > + * task's shadow stack in memory. This can make it harder for an > > + * attacker to locate the shadow stack, but also requires us to > > + * compute the base address when needed. > > + * > > + * We assume the stack is aligned to SCS_SIZE. > > + */ > > How about: > > /* > * To minimize risk the of exposure, architectures may clear a > * task's thread_info::shadow_call_stack while that task is > * running, and only save/restore the active shadow call stack > * pointer when the usual register may be clobbered (e.g. across > * context switches). > * > * The shadow call stack is aligned to SCS_SIZE, and grows > * upwards, so we can mask out the low bits to extract the base > * when the task is not running. > */ > > ... which I think makes the lifetime and constraints a bit clearer. Sounds good to me, thanks. > > + return (void *)((uintptr_t)task_scs(tsk) & ~(SCS_SIZE - 1)); > > We usually use unsigned long ratehr than uintptr_t. Could we please use > that for consistency? > > The kernel relies on sizeof(unsigned long) == sizeof(void *) tree-wide, > so that doesn't cause issues for us here. > > Similarly, as suggested above, it would be easier to reason about this > knowing that SCS_SIZE is an unsigned long. While IIUC we'd get sign > extension here when it's promoted, giving the definition a UL suffix > minimizes the scope for error. OK, I'll switch to unsigned long. > > +/* Keep a cache of shadow stacks */ > > +#define SCS_CACHE_SIZE 2 > > How about: > > /* Matches NR_CACHED_STACKS for VMAP_STACK */ > #define NR_CACHED_SCS 2 > > ... which explains where the number came from, and avoids confusion that > the SIZE is a byte size rather than number of elements. Agreed, that sounds better. > > +static void scs_free(void *s) > > +{ > > + int i; > > + > > + for (i = 0; i < SCS_CACHE_SIZE; i++) > > + if (this_cpu_cmpxchg(scs_cache[i], 0, s) == 0) > > + return; > > Here we should compare to NULL rather than 0. Ack. > > +void __init scs_init(void) > > +{ > > + cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "scs:scs_cache", NULL, > > + scs_cleanup); > > We probably want to do something if this call fails. It looks like we'd > only leak two pages (and we'd be able to use them if/when that CPU is > brought back online. A WARN_ON() is probably fine. fork_init() in kernel/fork.c lets this fail quietly, but adding a WARN_ON seems fine. I will include these changes in v5. Sami