On Tue, Sep 22, 2015 at 10:55 AM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > On Mon, Sep 21, 2015 at 11:23 PM, Ingo Molnar <mingo@xxxxxxxxxx> wrote: >> Add a late PGD init callback to places that allocate a new MM >> with a new PGD: copy_process() and exec(). >> >> The purpose of this callback is to allow architectures to implement >> lockless initialization of task PGDs, to remove the scalability >> limit of pgd_list/pgd_lock. > > Do we really need this? > > Can't we just initialize the pgd when we allocate it, knowing that > it's not in sync, but just depend on the vmalloc fault to add in any > kernel entries that we might have missed? I really really hate the vmalloc fault thing. It seems to work, rather to my surprise. It doesn't *deserve* to work, because of things like the percpu TSS accesses in the entry code that happen without a valid stack. For all I know, there's a long history of this hitting on monster non-SMAP systems that are all buggy and rootable but no one notices because it's rare. On SMAP with non-malicious userspace, it's an instant double fault. With malicious userspace, it's rootable regardless of SMAP, but it's much harder with SMAP. If we start every mm with a fully zeroed pgd (which is what I think you're suggesting), then this starts affecting small systems as in addition to monster systems. I'd really rather go in the other directoin and completely eliminate vmalloc faults. We could do that by eagerly initializing all pgd, or we could do it by tracking, per-pgd, how up-to-date it is and fixing it up in switch_mm. The latter is a bit nasty on SMP. --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>