On Tue, Sep 22, 2015 at 11:44 AM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > On Tue, Sep 22, 2015 at 11:37 AM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote: >> kinds of mess. >> >> I don't think that anyone really wants to move #PF to IST, which means >> that we simply cannot handle vmalloc faults that happen when switching >> stacks after SYSCALL, no matter what fanciness we shove into the >> page_fault asm. > > But that's fine. The kernel stack is special. So yes, we want to make > sure that the kernel stack is always mapped in the thread whose stack > it is. > > But that's not a big and onerous guarantee to make. Not when the > *real* problem is "random vmalloc allocations made by other processes > that we are not in the least interested in, and we don't want to add > synchronization for". > It's the kernel stack, the TSS (for sp0) and rsp_scratch at least. But yes, that's not that onerous, and it's never lazily initialized elsewhere. How about this (long-term, not right now): Never free pgd entries. For each pgd, track the number of populated kernel entries. Also track the global (init_mm) number of existing kernel entries. At context switch time, if new_pgd has fewer entries that the total, sync it. This hits *at most* 256 times per thread, and otherwise it's just a single unlikely branch. It guarantees that we only ever take a vmalloc fault when accessing maps that didn't exist when we last context switched, which gets us all of the important percpu stuff and the kernel stack, even if we schedule onto a cpu that didn't exist when the mm was created. --Andy > Linus -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>