Re: [patch 080/262] lazy tlb: allow lazy tlb mm refcounting to be configurable

"Andy Lutomirski" <luto@xxxxxxxxxx> · Fri, 05 Nov 2021 21:29:19 -0700

On Fri, Nov 5, 2021, at 1:38 PM, Andrew Morton wrote:
> From: Nicholas Piggin <npiggin@xxxxxxxxx>
> Subject: lazy tlb: allow lazy tlb mm refcounting to be configurable
>
> Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
> when it is context switched.  This can be disabled by architectures that
> don't require this refcounting if they clean up lazy tlb mms when the last
> refcount is dropped.  Currently this is always enabled, which is what
> existing code does, so the patch is effectively a no-op.
>
> Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.

Still nacked by me.  Since I seem to have been doing a poor job of explaining my issues with this patch, I'll explain with code:

commit 54b675d9b28d9a56289d06a813250472bc621f40
Author: Andy Lutomirski <luto@xxxxxxxxxx>
Date:   Fri Nov 5 21:20:47 2021 -0700

    [HACK] demonstrate lazy tlb issues

diff --git a/arch/Kconfig b/arch/Kconfig
index cca27f1b5d0e..19f273642d8f 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -442,6 +442,7 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 config MMU_LAZY_TLB_REFCOUNT
 	def_bool y
 	depends on !MMU_LAZY_TLB_SHOOTDOWN
+	depends on !X86
 
 # This option allows MMU_LAZY_TLB_REFCOUNT=n. It ensures no CPUs are using an
 # mm as a lazy tlb beyond its last reference count, by shooting down these
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 25dd795497e8..c5a0c1e92524 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4902,6 +4902,13 @@ context_switch(struct rq *rq, struct task_struct *prev,
 	 */
 	arch_start_context_switch(prev);
 
+	/*
+	 * Sanity check: if something went wrong and the previous mm was
+	 * freed while we were still using it, KASAN might not notice
+	 * without help.
+	 */
+	kasan_check_byte(prev->active_mm);
+
 	/*
 	 * kernel -> kernel   lazy + transfer active
 	 *   user -> kernel   lazy + mmgrab_lazy_tlb() active

Build this with KASAN for x86 and try to boot it.  It splats left and right.  The issue is that the !MMU_LAZY_TLB_REFCOUNT mode, while safe under certain select circumstances (maybe -- still not quite convinced) cheats and ignores the fact that the scheduler itself maintains a pointer to the old mm.  On x86, on bare metal, we *already* don't access lazy mms after the process is gone because the pagetable freeing process shoots down the lazy mm, so we are compliant with all the supposed preconditions of this new mode.  But the scheduler itself still has this nonsense active_mm pointer, and, if anyone ever tries to do anything with it (e.g. the above hack to force kasan to validate it), it all blows up.

On top of this, the whole refcount-me-maybe mode seems incredibly fragile, and I don't think the kernel really benefits from having a set of refcount helpers that may or may not keep the supposedly refcounted object alive depending on config.  And the mere fact that my patch appears to work as long as kasan isn't in play should be a pretty good indicator that this whole thing is not terribly robust.

So I think there are a few credible choices:

1. Find an alternative solution that gets the performance we want without dangling references.

2. Make the MMU_LAZY_TLB_REFCOUNT mode genuinely safe.  This means literally ifdeffing out active_mm so it can't dangle.  Doing that cleanly will be a lot of nasty arch work.

I again apologize that my series is taking so long, although I think it's finally getting into decent shape.  I still need to deal with the scs mess (that's new), finish tidying up kthread, and make sure hotplug is good.  But all this is because this is really hairy code and I'm trying to do it right.

If anyone wants to help, help is welcome.  Otherwise, I really do intend to get it all the way done soon.

--Andy