Re: [PATCH] sched: Move task_mm_cid_work to mm delayed work

Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> · Thu, 12 Dec 2024 09:06:11 -0500

On 2024-12-12 06:09, Gabriele Monaco wrote:

On Wed, 2024-12-11 at 12:07 -0500, Mathieu Desnoyers wrote:
Here's where I'm in doubt, is a compact map more desirable than
reusing
the same mm_cids for cache locality?

This is a tradeoff between:

A) Preserving cache locality after a transition from many threads to
few
     threads, or after reducing the hamming weight of the allowed cpu
mask.

B) Making the mm_cid guarantees wrt nr threads and allowed cpu mask
easy
     to document and understand.

C) Allowing applications to eventually react to mm_cid compaction
after
     reduction of the nr threads or allowed cpu mask, making the
tracking
     of mm_cid compaction easier by shrinking it back towards 0 or
not.

D) Making sure applications that periodically reduce and then
increase
     again the nr threads or allowed cpu mask still benefit from good
     cache locality with mm_cid.


If not, should we perhaps ignore the recent_cid if it's larger than
the
map weight?
It seems the only way the recent_cid is unset is with migrations,
but
I'm not sure if forcing one would make the test vain as the cid
could
be dropped outside of task_mm_cid_work.

What do you think?

Can you try this patch ? (compile-tested only)

commit 500649e03c5c28443f431829732c580750657326
Author: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
Date:   Wed Dec 11 11:53:01 2024 -0500

      sched: shrink mm_cid allocation with nr thread/affinity

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 76f5f53a645f..b92e79770a93 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3657,10 +3657,24 @@ static inline int __mm_cid_try_get(struct
task_struct *t, struct mm_struct *mm)
   {
    struct cpumask *cidmask = mm_cidmask(mm);
    struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid;
- int cid = __this_cpu_read(pcpu_cid->recent_cid);
+ int cid, max_nr_cid, allowed_max_nr_cid;
   
+ /*
+ * After shrinking the number of threads or reducing the number
+ * of allowed cpus, reduce the value of max_nr_cid so expansion
+ * of cid allocation will preserve cache locality if the number
+ * of threads or allowed cpus increase again.
+ */
+ max_nr_cid = atomic_read(&mm->max_nr_cid);
+ while ((allowed_max_nr_cid = min_t(int, READ_ONCE(mm-
nr_cpus_allowed), atomic_read(&mm->mm_users))),
+ max_nr_cid > allowed_max_nr_cid) {
+ if (atomic_try_cmpxchg(&mm->max_nr_cid, &max_nr_cid,
allowed_max_nr_cid))
+ break;
+ }
    /* Try to re-use recent cid. This improves cache locality. */
- if (!mm_cid_is_unset(cid) && !cpumask_test_and_set_cpu(cid,
cidmask))
+ cid = __this_cpu_read(pcpu_cid->recent_cid);
+ if (!mm_cid_is_unset(cid) && cid < max_nr_cid &&
+     !cpumask_test_and_set_cpu(cid, cidmask))
    return cid;
    /*
    * Expand cid allocation if the maximum number of concurrency
@@ -3668,12 +3682,11 @@ static inline int __mm_cid_try_get(struct
task_struct *t, struct mm_struct *mm)
    * and number of threads. Expanding cid allocation as much as
    * possible improves cache locality.
    */
- cid = atomic_read(&mm->max_nr_cid);
- while (cid < READ_ONCE(mm->nr_cpus_allowed) && cid <
atomic_read(&mm->mm_users)) {
- if (!atomic_try_cmpxchg(&mm->max_nr_cid, &cid, cid + 1))
+ while (max_nr_cid < allowed_max_nr_cid) {
+ if (!atomic_try_cmpxchg(&mm->max_nr_cid, &max_nr_cid, max_nr_cid +
1))
    continue;
- if (!cpumask_test_and_set_cpu(cid, cidmask))
- return cid;
+ if (!cpumask_test_and_set_cpu(max_nr_cid, cidmask))
+ return max_nr_cid;
    }
    /*
    * Find the first available concurrency id.

Thanks for the patch, it seems much more robust than my simple
condition on the weight. It passes the test (both versions) we
previously discussed and doesn't seem to interfere with the general
rseq functionality as checked by the other selftests.

Great!

I'm not sure if I should run more tests on this one.

The other thing I'd be interested in is to see if those
patches introduce any performance regression (e.g. the
will-it-scale tests).

If you have spare cycles to try this out, that would be good,
else we can let the test bots complain.

I will come up with a V2 shortly and attach some performance
evaluations.

OK


Do you want to keep your patch separate or do I submit it together with
V2?

Let me prepare a proper patch with commit message, and then feel
free to add it into your series, so it benefits from your testing.

Thanks,

Mathieu


Thanks,
Gabriele


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com