This is part of a larger series here, but the beginning bit is irrelevant to the current discussion: https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/mm&id=203d39d11562575fd8bd6a094d97a3a332d8b265 This is IMO a lot better than v1. It's now almost entirely in generic code. (It looks like it's 100% generic, but that's a lie -- the generic code currently that all possible lazy mm refs are in mm_cpumask(), and that's not true on all arches. So, if we take my approach, we'll need to have a little arch hook to control this.) Here's how I think it fits with various arches: x86: On bare metal (i.e. paravirt flush unavailable), the loop won't do much. The existing TLB shootdown when user tables are freed will empty mm_cpumask of everything but the calling CPU. So x86 ends up pretty close to as good as we can get short of reworking mm_cpumask() itself. arm64: It needs the fixup above for correctness, but I think performance should be pretty good. Compared to current kernels, we lose an mmgrab() and mmdrop() on each lazy transition, and we add a reasonably fast loop over all cpus on process exit. Someone (probably me) needs to make sure we don't need some extra barriers. power: Similar to x86. s390x: Should be essentially the same as arm64. Other arches: I don't know. Further research is required. What do you all think? Andy Lutomirski (2): [NEEDS HELP] x86/mm: Handle unlazying membarrier core sync in the arch code [MOCKUP] sched/mm: Lightweight lazy mm refcounting arch/x86/mm/tlb.c | 17 +++++- kernel/fork.c | 4 ++ kernel/sched/core.c | 134 +++++++++++++++++++++++++++++++++++++------ kernel/sched/sched.h | 11 +++- 4 files changed, 145 insertions(+), 21 deletions(-) -- 2.28.0