Re: [patch 119/212] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

"Andy Lutomirski" <luto@xxxxxxxxxx> · Thu, 02 Sep 2021 15:28:52 -0700

On Thu, Sep 2, 2021, at 2:56 PM, Andrew Morton wrote:
> From: Nicholas Piggin <npiggin@xxxxxxxxx>
> Subject: lazy tlb: shoot lazies, a non-refcounting lazy tlb option
> 
> On big systems, the mm refcount can become highly contented when doing a
> lot of context switching with threaded applications (particularly
> switching between the idle thread and an application thread).
> 
> Abandoning lazy tlb slows switching down quite a bit in the important
> user->idle->user cases, so instead implement a non-refcounted scheme that
> causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down any
> remaining lazy ones.
> 
> Shootdown IPIs are some concern, but they have not been observed to be a
> big problem with this scheme (the powerpc implementation generated 314
> additional interrupts on a 144 CPU system during a kernel compile).  There
> are a number of strategies that could be employed to reduce IPIs if they
> turn out to be a problem for some workload.

This pile is:

Nacked-by: Andy Lutomirski <luto@xxxxxxxxxx>

For reasons that have been discussed previously. My series is still in progress.  It’s moving slowly for two reasons.  First, I have limited time to work on it. Second, the existing mm refcounting is a giant pile of worms, and that needs fixing one way or another before we add yet more complexity. For example, has anyone noticed that kthread mms are refcounted using different rules than everything else?

Even if my modified refcounting scheme isn’t the eventual winner, the prerequisite cleanups are still prerequisites. I absolutely nack anything that adds yet more nonsensical complexity to the existing scheme, makes it substantially more fragile, and does not fix the underlying crap that makes speeding up responsibly such a mess.

Nick or anyone else, you’re welcome to finish up my series (and I can give pointers) or you can wait.

> 
> [npiggin@xxxxxxxxx: update comments]
>   Link: https://lkml.kernel.org/r/1623121901.mszkmmum0n.astroid@xxxxxxxxx
> Link: https://lkml.kernel.org/r/20210605014216.446867-4-npiggin@xxxxxxxxx
> Signed-off-by: Nicholas Piggin <npiggin@xxxxxxxxx>
> Cc: Anton Blanchard <anton@xxxxxxxxxx>
> Cc: Andy Lutomirski <luto@xxxxxxxxxx>
> Cc: Randy Dunlap <rdunlap@xxxxxxxxxxxxx>
> Cc: Benjamin Herrenschmidt <benh@xxxxxxxxxxxxxxxxxxx>
> Cc: Paul Mackerras <paulus@xxxxxxxxxx>
> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> ---
> 
>  arch/Kconfig  |   14 +++++++++++++
>  kernel/fork.c |   51 ++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 65 insertions(+)
> 
> --- a/arch/Kconfig~lazy-tlb-shoot-lazies-a-non-refcounting-lazy-tlb-option
> +++ a/arch/Kconfig
> @@ -438,6 +438,20 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
>  # to a kthread ->active_mm (non-arch code has been converted already).
>  config MMU_LAZY_TLB_REFCOUNT
>  	def_bool y
> +	depends on !MMU_LAZY_TLB_SHOOTDOWN
> +
> +# This option allows MMU_LAZY_TLB_REFCOUNT=n. It ensures no CPUs are using an
> +# mm as a lazy tlb beyond its last reference count, by shooting down these
> +# users before the mm is deallocated. __mmdrop() first IPIs all CPUs that may
> +# be using the mm as a lazy tlb, so that they may switch themselves to using
> +# init_mm for their active mm. mm_cpumask(mm) is used to determine which CPUs
> +# may be using mm as a lazy tlb mm.
> +#
> +# To implement this, an arch must ensure mm_cpumask(mm) contains at least all
> +# possible CPUs in which the mm is lazy, and it must meet the requirements for
> +# MMU_LAZY_TLB_REFCOUNT=n (see above).
> +config MMU_LAZY_TLB_SHOOTDOWN
> +	bool
>  
>  config ARCH_HAVE_NMI_SAFE_CMPXCHG
>  	bool
> --- a/kernel/fork.c~lazy-tlb-shoot-lazies-a-non-refcounting-lazy-tlb-option
> +++ a/kernel/fork.c
> @@ -674,6 +674,53 @@ static void check_mm(struct mm_struct *m
>  #define allocate_mm()	(kmem_cache_alloc(mm_cachep, GFP_KERNEL))
>  #define free_mm(mm)	(kmem_cache_free(mm_cachep, (mm)))
>  
> +static void do_shoot_lazy_tlb(void *arg)
> +{
> +	struct mm_struct *mm = arg;
> +
> +	if (current->active_mm == mm) {
> +		WARN_ON_ONCE(current->mm);
> +		current->active_mm = &init_mm;
> +		switch_mm(mm, &init_mm, current);
> +	}
> +}
> +
> +static void do_check_lazy_tlb(void *arg)
> +{
> +	struct mm_struct *mm = arg;
> +
> +	WARN_ON_ONCE(current->active_mm == mm);
> +}
> +
> +static void shoot_lazy_tlbs(struct mm_struct *mm)
> +{
> +	if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) {
> +		/*
> +		 * IPI overheads have not found to be expensive, but they could
> +		 * be reduced in a number of possible ways, for example (in
> +		 * roughly increasing order of complexity):
> +		 * - A batch of mms requiring IPIs could be gathered and freed
> +		 *   at once.
> +		 * - CPUs could store their active mm somewhere that can be
> +		 *   remotely checked without a lock, to filter out
> +		 *   false-positives in the cpumask.
> +		 * - After mm_users or mm_count reaches zero, switching away
> +		 *   from the mm could clear mm_cpumask to reduce some IPIs
> +		 *   (some batching or delaying would help).
> +		 * - A delayed freeing and RCU-like quiescing sequence based on
> +		 *   mm switching to avoid IPIs completely.
> +		 */
> +		on_each_cpu_mask(mm_cpumask(mm), do_shoot_lazy_tlb, (void *)mm, 1);
> +		if (IS_ENABLED(CONFIG_DEBUG_VM))
> +			on_each_cpu(do_check_lazy_tlb, (void *)mm, 1);
> +	} else {
> +		/*
> +		 * In this case, lazy tlb mms are refounted and would not reach
> +		 * __mmdrop until all CPUs have switched away and mmdrop()ed.
> +		 */
> +	}
> +}
> +
>  /*
>   * Called when the last reference to the mm
>   * is dropped: either by a lazy thread or by
> @@ -683,6 +730,10 @@ void __mmdrop(struct mm_struct *mm)
>  {
>  	BUG_ON(mm == &init_mm);
>  	WARN_ON_ONCE(mm == current->mm);
> +
> +	/* Ensure no CPUs are using this as their lazy tlb mm */
> +	shoot_lazy_tlbs(mm);
> +
>  	WARN_ON_ONCE(mm == current->active_mm);
>  	mm_free_pgd(mm);
>  	destroy_context(mm);
> _
>