On Wed, Mar 13, 2024 at 9:43 AM Pasha Tatashin <pasha.tatashin@xxxxxxxxxx> wrote: > > On Wed, Mar 13, 2024 at 6:23 AM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote: > > > > On Mon, Mar 11 2024 at 16:46, Pasha Tatashin wrote: > > > @@ -413,6 +413,9 @@ DEFINE_IDTENTRY_DF(exc_double_fault) > > > } > > > #endif > > > > > > + if (dynamic_stack_fault(current, address)) > > > + return; > > > + > > > irqentry_nmi_enter(regs); > > > instrumentation_begin(); > > > notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV); > > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > > > index d6375b3c633b..651c558b10eb 100644 > > > --- a/arch/x86/mm/fault.c > > > +++ b/arch/x86/mm/fault.c > > > @@ -1198,6 +1198,9 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code, > > > if (is_f00f_bug(regs, hw_error_code, address)) > > > return; > > > > > > + if (dynamic_stack_fault(current, address)) > > > + return; > > > > T1 schedules out with stack used close to the fault boundary. > > > > switch_to(T2) > > > > Now T1 schedules back in > > > > switch_to(T1) > > __switch_to_asm() > > ... > > switch_stacks() <- SP on T1 stack > > ! ... > > ! jmp __switch_to() > > ! __switch_to() > > ! ... > > ! raw_cpu_write(pcpu_hot.current_task, next_p); > > > > After switching SP to T1's stack and up to the point where > > pcpu_hot.current_task (aka current) is updated to T1 a stack fault will > > invoke dynamic_stack_fault(T2, address) which will return false here: > > > > /* check if address is inside the kernel stack area */ > > stack = (unsigned long)tsk->stack; > > if (address < stack || address >= stack + THREAD_SIZE) > > return false; > > > > because T2's stack does obviously not cover the faulting address on T1's > > stack. As a consequence double fault will panic the machine. > > Hi Thomas, > > Thank you, you are absolutely right, we can't trust "current" in the > fault handler. > > We can change dynamic_stack_fault() to only accept fault_address as an > argument, and let it determine the right task_struct pointer > internally. > > Let's modify dynamic_stack_fault() to accept only the fault_address. > It can then determine the correct task_struct pointer internally. > > Here's a potential solution that is fast, avoids locking, and ensures atomicity: > > 1. Kernel Stack VA Space > Dedicate a virtual address range ([KSTACK_START_VA - KSTACK_END_VA]) > exclusively for kernel stacks. This simplifies validation of faulting > addresses to be part of a stack. > > 2. Finding the faulty task > - Use ALIGN(fault_address, THREAD_SIZE) to calculate the end of the > topmost stack page (since stack addresses are aligned to THREAD_SIZE). > - Store the task_struct pointer as the last word on this topmost page, > that is always present as it is a pre-allcated stack page. > > 3. Stack Padding > Increase padding to 8 bytes on x86_64 (TOP_OF_KERNEL_STACK_PADDING 8) > to accommodate the task_struct pointer. Alternatively, do not even look-up the task_struct in dynamic_stack_fault(), but only install the mapping to the faulting address, store va in the per-cpu array, and handle the rest in dynamic_stack() during context switching. At that time spin locks can be taken, and we can do a find_vm_area(addr) call. This way, we would not need to modify TOP_OF_KERNEL_STACK_PADDING to keep task_struct in there. Pasha