Re: [RFC 11/14] x86: add support for Dynamic Kernel Stacks

Pasha Tatashin <pasha.tatashin@xxxxxxxxxx> · Wed, 13 Mar 2024 09:43:50 -0400

On Wed, Mar 13, 2024 at 6:23 AM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
>
> On Mon, Mar 11 2024 at 16:46, Pasha Tatashin wrote:
> > @@ -413,6 +413,9 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
> >       }
> >  #endif
> >
> > +     if (dynamic_stack_fault(current, address))
> > +             return;
> > +
> >       irqentry_nmi_enter(regs);
> >       instrumentation_begin();
> >       notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);
> > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > index d6375b3c633b..651c558b10eb 100644
> > --- a/arch/x86/mm/fault.c
> > +++ b/arch/x86/mm/fault.c
> > @@ -1198,6 +1198,9 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
> >       if (is_f00f_bug(regs, hw_error_code, address))
> >               return;
> >
> > +     if (dynamic_stack_fault(current, address))
> > +             return;
>
> T1 schedules out with stack used close to the fault boundary.
>
>    switch_to(T2)
>
> Now T1 schedules back in
>
>    switch_to(T1)
>      __switch_to_asm()
>         ...
>         switch_stacks()         <- SP on T1 stack
> !        ...
> !       jmp __switch_to()
> !    __switch_to()
> !       ...
> !       raw_cpu_write(pcpu_hot.current_task, next_p);
>
> After switching SP to T1's stack and up to the point where
> pcpu_hot.current_task (aka current) is updated to T1 a stack fault will
> invoke dynamic_stack_fault(T2, address) which will return false here:
>
>         /* check if address is inside the kernel stack area */
>         stack = (unsigned long)tsk->stack;
>         if (address < stack || address >= stack + THREAD_SIZE)
>                 return false;
>
> because T2's stack does obviously not cover the faulting address on T1's
> stack. As a consequence double fault will panic the machine.

Hi Thomas,

Thank you, you are absolutely right, we can't trust "current" in the
fault handler.

We can change dynamic_stack_fault() to only accept fault_address as an
argument, and let it determine the right task_struct pointer
internally.

Let's modify dynamic_stack_fault() to accept only the fault_address.
It can then determine the correct task_struct pointer internally.

Here's a potential solution that is fast, avoids locking, and ensures atomicity:

1. Kernel Stack VA Space
Dedicate a virtual address range ([KSTACK_START_VA - KSTACK_END_VA])
exclusively for kernel stacks. This simplifies validation of faulting
addresses to be part of a stack.

2. Finding the faulty task
- Use ALIGN(fault_address, THREAD_SIZE) to calculate the end of the
topmost stack page (since stack addresses are aligned to THREAD_SIZE).
- Store the task_struct pointer as the last word on this topmost page,
that is always present as it is a pre-allcated stack page.

3. Stack Padding
Increase padding to 8 bytes on x86_64 (TOP_OF_KERNEL_STACK_PADDING 8)
to accommodate the task_struct pointer.

Another issue that this race brings is that 3-pages per-cpu might not
be enough, we might need up-to 6 pages: 3 to cover going-away task,
and 3 to cover the new task.

Pasha