On March 14, 2024 8:13:56 PM PDT, Pasha Tatashin <pasha.tatashin@xxxxxxxxxx> wrote: >On Thu, Mar 14, 2024 at 3:57 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: >> >> On Thu, Mar 14, 2024 at 03:53:39PM -0400, Kent Overstreet wrote: >> > On Thu, Mar 14, 2024 at 07:43:06PM +0000, Matthew Wilcox wrote: >> > > On Tue, Mar 12, 2024 at 10:18:10AM -0700, H. Peter Anvin wrote: >> > > > Second, non-dynamic kernel memory is one of the core design decisions in >> > > > Linux from early on. This means there are lot of deeply embedded assumptions >> > > > which would have to be untangled. >> > > >> > > I think there are other ways of getting the benefit that Pasha is seeking >> > > without moving to dynamically allocated kernel memory. One icky thing >> > > that XFS does is punt work over to a kernel thread in order to use more >> > > stack! That breaks a number of things including lockdep (because the >> > > kernel thread doesn't own the lock, the thread waiting for the kernel >> > > thread owns the lock). >> > > >> > > If we had segmented stacks, XFS could say "I need at least 6kB of stack", >> > > and if less than that was available, we could allocate a temporary >> > > stack and switch to it. I suspect Google would also be able to use this >> > > API for their rare cases when they need more than 8kB of kernel stack. >> > > Who knows, we might all be able to use such a thing. >> > > >> > > I'd been thinking about this from the point of view of allocating more >> > > stack elsewhere in kernel space, but combining what Pasha has done here >> > > with this idea might lead to a hybrid approach that works better; allocate >> > > 32kB of vmap space per kernel thread, put 12kB of memory at the top of it, >> > > rely on people using this "I need more stack" API correctly, and free the >> > > excess pages on return to userspace. No complicated "switch stacks" API >> > > needed, just an "ensure we have at least N bytes of stack remaining" API. > >I like this approach! I think we could also consider having permanent >big stacks for some kernel only threads like kvm-vcpu. A cooperative >stack increase framework could work well and wouldn't negatively >impact the performance of context switching. However, thorough >analysis would be necessary to proactively identify potential stack >overflow situations. > >> > Why would we need an "I need more stack" API? Pasha's approach seems >> > like everything we need for what you're talking about. >> >> Because double faults are hard, possibly impossible, and the FRED approach >> Peter described has extra overhead? This was all described up-thread. > >Handling faults in #DF is possible. It requires code inspection to >handle race conditions such as what was shown by tglx. However, as >Andy pointed out, this is not supported by SDM as it is an abort >context (yet we return from it because of ESPFIX64, so return is >possible). > >My question, however, if we ignore memory savings and only consider >reliability aspect of this feature. What is better unconditionally >crashing the machine because a guard page was reached, or printing a >huge warning with a backtracing information about the offending stack, >handling the fault, and survive? I know that historically Linus >preferred WARN() to BUG() [1]. But, this is a somewhat different >scenario compared to simple BUG vs WARN. > >Pasha > >[1] https://lore.kernel.org/all/Pine.LNX.4.44.0209091832160.1714-100000@xxxxxxxxxxxxxxxxxx > The real issue with using #DF is that if the event that caused it was asynchronous, you could lose the event.