On Wed, Mar 13, 2024 at 6:23 AM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote: > > On Mon, Mar 11 2024 at 16:46, Pasha Tatashin wrote: > > @@ -413,6 +413,9 @@ DEFINE_IDTENTRY_DF(exc_double_fault) > > } > > #endif > > > > + if (dynamic_stack_fault(current, address)) > > + return; > > + > > irqentry_nmi_enter(regs); > > instrumentation_begin(); > > notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV); > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > > index d6375b3c633b..651c558b10eb 100644 > > --- a/arch/x86/mm/fault.c > > +++ b/arch/x86/mm/fault.c > > @@ -1198,6 +1198,9 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code, > > if (is_f00f_bug(regs, hw_error_code, address)) > > return; > > > > + if (dynamic_stack_fault(current, address)) > > + return; > > T1 schedules out with stack used close to the fault boundary. > > switch_to(T2) > > Now T1 schedules back in > > switch_to(T1) > __switch_to_asm() > ... > switch_stacks() <- SP on T1 stack > ! ... > ! jmp __switch_to() > ! __switch_to() > ! ... > ! raw_cpu_write(pcpu_hot.current_task, next_p); > > After switching SP to T1's stack and up to the point where > pcpu_hot.current_task (aka current) is updated to T1 a stack fault will > invoke dynamic_stack_fault(T2, address) which will return false here: > > /* check if address is inside the kernel stack area */ > stack = (unsigned long)tsk->stack; > if (address < stack || address >= stack + THREAD_SIZE) > return false; > > because T2's stack does obviously not cover the faulting address on T1's > stack. As a consequence double fault will panic the machine. Hi Thomas, Thank you, you are absolutely right, we can't trust "current" in the fault handler. We can change dynamic_stack_fault() to only accept fault_address as an argument, and let it determine the right task_struct pointer internally. Let's modify dynamic_stack_fault() to accept only the fault_address. It can then determine the correct task_struct pointer internally. Here's a potential solution that is fast, avoids locking, and ensures atomicity: 1. Kernel Stack VA Space Dedicate a virtual address range ([KSTACK_START_VA - KSTACK_END_VA]) exclusively for kernel stacks. This simplifies validation of faulting addresses to be part of a stack. 2. Finding the faulty task - Use ALIGN(fault_address, THREAD_SIZE) to calculate the end of the topmost stack page (since stack addresses are aligned to THREAD_SIZE). - Store the task_struct pointer as the last word on this topmost page, that is always present as it is a pre-allcated stack page. 3. Stack Padding Increase padding to 8 bytes on x86_64 (TOP_OF_KERNEL_STACK_PADDING 8) to accommodate the task_struct pointer. Another issue that this race brings is that 3-pages per-cpu might not be enough, we might need up-to 6 pages: 3 to cover going-away task, and 3 to cover the new task. Pasha