Re: [RFC 00/14] Dynamic Kernel Stacks

"H. Peter Anvin" <hpa@xxxxxxxxx> · Sat, 16 Mar 2024 17:47:18 -0700

On March 14, 2024 12:43:06 PM PDT, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote:
>> Second, non-dynamic kernel memory is one of the core design decisions in
>> Linux from early on. This means there are lot of deeply embedded assumptions
>> which would have to be untangled.
>
>I think there are other ways of getting the benefit that Pasha is seeking
>without moving to dynamically allocated kernel memory.  One icky thing
>that XFS does is punt work over to a kernel thread in order to use more
>stack!  That breaks a number of things including lockdep (because the
>kernel thread doesn't own the lock, the thread waiting for the kernel
>thread owns the lock).
>
>If we had segmented stacks, XFS could say "I need at least 6kB of stack",
>and if less than that was available, we could allocate a temporary
>stack and switch to it.  I suspect Google would also be able to use this
>API for their rare cases when they need more than 8kB of kernel stack.
>Who knows, we might all be able to use such a thing.
>
>I'd been thinking about this from the point of view of allocating more
>stack elsewhere in kernel space, but combining what Pasha has done here
>with this idea might lead to a hybrid approach that works better; allocate
>32kB of vmap space per kernel thread, put 12kB of memory at the top of it,
>rely on people using this "I need more stack" API correctly, and free the
>excess pages on return to userspace.  No complicated "switch stacks" API
>needed, just an "ensure we have at least N bytes of stack remaining" API.

This is what stack probes basically does. It provides a very cheap "API" that goes via the #PF (not #DF!) path in the slow case, but synchronously at a well-defined point, but is virtually free in the common case. As a side benefit, they can be compiler-generated, as some operating systems require them.