On Mon, Mar 18, 2024 at 11:09 AM Pasha Tatashin <pasha.tatashin@xxxxxxxxxx> wrote: > > On Sun, Mar 17, 2024 at 2:58 PM David Laight <David.Laight@xxxxxxxxxx> wrote: > > > > From: Pasha Tatashin > > > Sent: 16 March 2024 19:18 > > ... > > > Expanding on Mathew's idea of an interface for dynamic kernel stack > > > sizes, here's what I'm thinking: > > > > > > - Kernel Threads: Create all kernel threads with a fully populated > > > THREAD_SIZE stack. (i.e. 16K) > > > - User Threads: Create all user threads with THREAD_SIZE kernel stack > > > but only the top page mapped. (i.e. 4K) > > > - In enter_from_user_mode(): Expand the thread stack to 16K by mapping > > > three additional pages from the per-CPU stack cache. This function is > > > called early in kernel entry points. > > > - exit_to_user_mode(): Unmap the extra three pages and return them to > > > the per-CPU cache. This function is called late in the kernel exit > > > path. > > > > Isn't that entirely horrid for TLB use and so will require a lot of IPI? > > The TLB load is going to be exactly the same as today, we already use > small pages for VMA mapped stacks. We won't need to have extra > flushing either, the mappings are in the kernel space, and once pages > are removed from the page table, no one is going to access that VA > space until that thread enters the kernel again. We will need to > invalidate the VA range only when the pages are mapped, and only on > the local cpu. The TLB miss rate is going to slightly increase, but very slightly, because stacks are small 4-pages with only 3-dynamic pages, and therefore only up-to 2-3 new misses per syscalls, and that is only for the complicated deep syscalls, therefore, I suspect it won't affect the real world performance. > > Remember, if a thread sleeps in 'extra stack' and is then resheduled > > on a different cpu the extra pages get 'pumped' from one cpu to > > another. > > Yes, the per-cpu cache can get unbalanced this way, we can remember > the original CPU where we acquired the pages to return to the same > place. > > > I also suspect a stack_probe() is likely to end up being a cache miss > > and also slow??? > > Can you please elaborate on this point. I am not aware of > stack_probe() and how it is used. > > > So you wouldn't want one on all calls. > > I'm not sure you'd want a conditional branch either. > > > > The explicit request for 'more stack' can be required to be allowed > > to sleep - removing a lot of issues. > > It would also be portable to all architectures. > > I'd also suspect that any thread that needs extra stack is likely > > to need to again. > > So while the memory could be recovered, I'd bet is isn't worth > > doing except under memory pressure. > > The call could also return 'no' - perhaps useful for (broken) code > > that insists on being recursive. > > The current approach discussed is somewhat different from explicit > more stack requests API. I am investigating how feasible it is to use > kernel stack multiplexing, so the same pages can be re-used by many > threads when they are actually used. If the multiplexing approach > won't work, I will come back to the explicit more stack API. > > > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > > Registration No: 1397386 (Wales)