Re: [PATCH] sched: Move task_mm_cid_work to mm delayed work

Gabriele Monaco <gmonaco@xxxxxxxxxx> · Thu, 12 Dec 2024 12:09:11 +0100

On Wed, 2024-12-11 at 12:07 -0500, Mathieu Desnoyers wrote:
> > Here's where I'm in doubt, is a compact map more desirable than
> > reusing
> > the same mm_cids for cache locality?
> 
> This is a tradeoff between:
> 
> A) Preserving cache locality after a transition from many threads to
> few
>     threads, or after reducing the hamming weight of the allowed cpu
> mask.
> 
> B) Making the mm_cid guarantees wrt nr threads and allowed cpu mask
> easy
>     to document and understand.
> 
> C) Allowing applications to eventually react to mm_cid compaction
> after
>     reduction of the nr threads or allowed cpu mask, making the
> tracking
>     of mm_cid compaction easier by shrinking it back towards 0 or
> not.
> 
> D) Making sure applications that periodically reduce and then
> increase
>     again the nr threads or allowed cpu mask still benefit from good
>     cache locality with mm_cid.
> 
> 
> > If not, should we perhaps ignore the recent_cid if it's larger than
> > the
> > map weight?
> > It seems the only way the recent_cid is unset is with migrations,
> > but
> > I'm not sure if forcing one would make the test vain as the cid
> > could
> > be dropped outside of task_mm_cid_work.
> > 
> > What do you think?
> 
> Can you try this patch ? (compile-tested only)
> 
> commit 500649e03c5c28443f431829732c580750657326
> Author: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
> Date:   Wed Dec 11 11:53:01 2024 -0500
> 
>      sched: shrink mm_cid allocation with nr thread/affinity
> 
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 76f5f53a645f..b92e79770a93 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3657,10 +3657,24 @@ static inline int __mm_cid_try_get(struct
> task_struct *t, struct mm_struct *mm)
>   {
>    struct cpumask *cidmask = mm_cidmask(mm);
>    struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid;
> - int cid = __this_cpu_read(pcpu_cid->recent_cid);
> + int cid, max_nr_cid, allowed_max_nr_cid;
>   
> + /*
> + * After shrinking the number of threads or reducing the number
> + * of allowed cpus, reduce the value of max_nr_cid so expansion
> + * of cid allocation will preserve cache locality if the number
> + * of threads or allowed cpus increase again.
> + */
> + max_nr_cid = atomic_read(&mm->max_nr_cid);
> + while ((allowed_max_nr_cid = min_t(int, READ_ONCE(mm-
> >nr_cpus_allowed), atomic_read(&mm->mm_users))),
> + max_nr_cid > allowed_max_nr_cid) {
> + if (atomic_try_cmpxchg(&mm->max_nr_cid, &max_nr_cid,
> allowed_max_nr_cid))
> + break;
> + }
>    /* Try to re-use recent cid. This improves cache locality. */
> - if (!mm_cid_is_unset(cid) && !cpumask_test_and_set_cpu(cid,
> cidmask))
> + cid = __this_cpu_read(pcpu_cid->recent_cid);
> + if (!mm_cid_is_unset(cid) && cid < max_nr_cid &&
> +     !cpumask_test_and_set_cpu(cid, cidmask))
>    return cid;
>    /*
>    * Expand cid allocation if the maximum number of concurrency
> @@ -3668,12 +3682,11 @@ static inline int __mm_cid_try_get(struct
> task_struct *t, struct mm_struct *mm)
>    * and number of threads. Expanding cid allocation as much as
>    * possible improves cache locality.
>    */
> - cid = atomic_read(&mm->max_nr_cid);
> - while (cid < READ_ONCE(mm->nr_cpus_allowed) && cid <
> atomic_read(&mm->mm_users)) {
> - if (!atomic_try_cmpxchg(&mm->max_nr_cid, &cid, cid + 1))
> + while (max_nr_cid < allowed_max_nr_cid) {
> + if (!atomic_try_cmpxchg(&mm->max_nr_cid, &max_nr_cid, max_nr_cid +
> 1))
>    continue;
> - if (!cpumask_test_and_set_cpu(cid, cidmask))
> - return cid;
> + if (!cpumask_test_and_set_cpu(max_nr_cid, cidmask))
> + return max_nr_cid;
>    }
>    /*
>    * Find the first available concurrency id.

Thanks for the patch, it seems much more robust than my simple
condition on the weight. It passes the test (both versions) we
previously discussed and doesn't seem to interfere with the general
rseq functionality as checked by the other selftests.
I'm not sure if I should run more tests on this one.
I will come up with a V2 shortly and attach some performance
evaluations.

Do you want to keep your patch separate or do I submit it together with
V2?

Thanks,
Gabriele