On Thu, Mar 14, 2024 at 03:53:39PM -0400, Kent Overstreet wrote: > On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox wrote: > > On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote: > > > Second, non-dynamic kernel memory is one of the core design decisions in > > > Linux from early on. This means there are lot of deeply embedded assumptions > > > which would have to be untangled. > > > > I think there are other ways of getting the benefit that Pasha is seeking > > without moving to dynamically allocated kernel memory. One icky thing > > that XFS does is punt work over to a kernel thread in order to use more > > stack! That breaks a number of things including lockdep (because the > > kernel thread doesn't own the lock, the thread waiting for the kernel > > thread owns the lock). > > > > If we had segmented stacks, XFS could say "I need at least 6kB of stack", > > and if less than that was available, we could allocate a temporary > > stack and switch to it. I suspect Google would also be able to use this > > API for their rare cases when they need more than 8kB of kernel stack. > > Who knows, we might all be able to use such a thing. > > > > I'd been thinking about this from the point of view of allocating more > > stack elsewhere in kernel space, but combining what Pasha has done here > > with this idea might lead to a hybrid approach that works better; allocate > > 32kB of vmap space per kernel thread, put 12kB of memory at the top of it, > > rely on people using this "I need more stack" API correctly, and free the > > excess pages on return to userspace. No complicated "switch stacks" API > > needed, just an "ensure we have at least N bytes of stack remaining" API. > > Why would we need an "I need more stack" API? Pasha's approach seems > like everything we need for what you're talking about. Because double faults are hard, possibly impossible, and the FRED approach Peter described has extra overhead? This was all described up-thread.