On March 14, 2024 12:43:06 PM PDT, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: >On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote: >> Second, non-dynamic kernel memory is one of the core design decisions in >> Linux from early on. This means there are lot of deeply embedded assumptions >> which would have to be untangled. > >I think there are other ways of getting the benefit that Pasha is seeking >without moving to dynamically allocated kernel memory. One icky thing >that XFS does is punt work over to a kernel thread in order to use more >stack! That breaks a number of things including lockdep (because the >kernel thread doesn't own the lock, the thread waiting for the kernel >thread owns the lock). > >If we had segmented stacks, XFS could say "I need at least 6kB of stack", >and if less than that was available, we could allocate a temporary >stack and switch to it. I suspect Google would also be able to use this >API for their rare cases when they need more than 8kB of kernel stack. >Who knows, we might all be able to use such a thing. > >I'd been thinking about this from the point of view of allocating more >stack elsewhere in kernel space, but combining what Pasha has done here >with this idea might lead to a hybrid approach that works better; allocate >32kB of vmap space per kernel thread, put 12kB of memory at the top of it, >rely on people using this "I need more stack" API correctly, and free the >excess pages on return to userspace. No complicated "switch stacks" API >needed, just an "ensure we have at least N bytes of stack remaining" API. This is what stack probes basically does. It provides a very cheap "API" that goes via the #PF (not #DF!) path in the slow case, but synchronously at a well-defined point, but is virtually free in the common case. As a side benefit, they can be compiler-generated, as some operating systems require them.