On 2024-12-11 07:27, Gabriele Monaco wrote:
On Mon, 2024-12-09 at 10:48 -0500, Mathieu Desnoyers wrote:
On 2024-12-09 10:33, Mathieu Desnoyers wrote:
A small tweak on your proposed approach: in phase 1, get each
thread
to publish which mm_cid they observe, and select one thread which
has observed mm_cid > 1 (possibly the largest mm_cid) as the thread
that will keep running in phase 2 (in addition to the main thread).
All threads other than the main thread and that selected thread
exit
and are joined before phase 2.
So you end up in phase 2 with:
- main (observed any mm_cid)
- selected thread (observed mm_cid > 1, possibly largest)
Then after a while, the selected thread should observe a
mm_cid <= 1.
This test should be skipped if there are less than 3 CPUs in
allowed cpumask (sched_getaffinity).
Even better:
For a sched_getaffinity with N cpus:
- If N == 1 -> skip (we cannot validate anything)
Phase 1: create N - 1 pthreads, each pinned to a CPU. main thread
also pinned to a cpu.
Publish the mm_cids observed by each thread, including main thread.
Select a new leader for phase 2: a thread which has observed nonzero
mm_cid. Each other thread including possibly main thread issue
pthread_exit, and the new leader does pthread join on each other.
Then check that the new leader eventually observe mm_cid == 0.
And it works with an allowed cpu mask that has only 2 cpus.
Sounds even neater, thanks for the tips, I'll try this last one out!
Coming back to the implementation, I have been trying to validate my
approach with this test, wrapped my head around it, and found out that
the test can't actually pass on the latest upstream.
When an mm_cid is lazy dropped to compact the mask, it is again re-
assigned while switching in.
The the change introduced in "sched: Improve cache locality of RSEQ
concurrency IDs for intermittent workloads" adds a recent_cid and it
seems that is never unset during the test (nothing migrates).
Good point!
Now, I'm still running my first version of the test, so I have a thread
running on CPU0 with mm_cid=0 and another running on CPU127 with
mm_cid, say, 127 (weight=2).
In practice, the test is expecting 127 to be dropped (>2) but this is
not the case since 127 could exhibit better cache locality, so it is
selected on the next round.
Understood.
Here's where I'm in doubt, is a compact map more desirable than reusing
the same mm_cids for cache locality?
This is a tradeoff between:
A) Preserving cache locality after a transition from many threads to few
threads, or after reducing the hamming weight of the allowed cpu mask.
B) Making the mm_cid guarantees wrt nr threads and allowed cpu mask easy
to document and understand.
C) Allowing applications to eventually react to mm_cid compaction after
reduction of the nr threads or allowed cpu mask, making the tracking
of mm_cid compaction easier by shrinking it back towards 0 or not.
D) Making sure applications that periodically reduce and then increase
again the nr threads or allowed cpu mask still benefit from good
cache locality with mm_cid.
If not, should we perhaps ignore the recent_cid if it's larger than the
map weight?
It seems the only way the recent_cid is unset is with migrations, but
I'm not sure if forcing one would make the test vain as the cid could
be dropped outside of task_mm_cid_work.
What do you think?
Can you try this patch ? (compile-tested only)
commit 500649e03c5c28443f431829732c580750657326
Author: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
Date: Wed Dec 11 11:53:01 2024 -0500
sched: shrink mm_cid allocation with nr thread/affinity
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 76f5f53a645f..b92e79770a93 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3657,10 +3657,24 @@ static inline int __mm_cid_try_get(struct task_struct *t, struct mm_struct *mm)
{
struct cpumask *cidmask = mm_cidmask(mm);
struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid;
- int cid = __this_cpu_read(pcpu_cid->recent_cid);
+ int cid, max_nr_cid, allowed_max_nr_cid;
+ /*
+ * After shrinking the number of threads or reducing the number
+ * of allowed cpus, reduce the value of max_nr_cid so expansion
+ * of cid allocation will preserve cache locality if the number
+ * of threads or allowed cpus increase again.
+ */
+ max_nr_cid = atomic_read(&mm->max_nr_cid);
+ while ((allowed_max_nr_cid = min_t(int, READ_ONCE(mm->nr_cpus_allowed), atomic_read(&mm->mm_users))),
+ max_nr_cid > allowed_max_nr_cid) {
+ if (atomic_try_cmpxchg(&mm->max_nr_cid, &max_nr_cid, allowed_max_nr_cid))
+ break;
+ }
/* Try to re-use recent cid. This improves cache locality. */
- if (!mm_cid_is_unset(cid) && !cpumask_test_and_set_cpu(cid, cidmask))
+ cid = __this_cpu_read(pcpu_cid->recent_cid);
+ if (!mm_cid_is_unset(cid) && cid < max_nr_cid &&
+ !cpumask_test_and_set_cpu(cid, cidmask))
return cid;
/*
* Expand cid allocation if the maximum number of concurrency
@@ -3668,12 +3682,11 @@ static inline int __mm_cid_try_get(struct task_struct *t, struct mm_struct *mm)
* and number of threads. Expanding cid allocation as much as
* possible improves cache locality.
*/
- cid = atomic_read(&mm->max_nr_cid);
- while (cid < READ_ONCE(mm->nr_cpus_allowed) && cid < atomic_read(&mm->mm_users)) {
- if (!atomic_try_cmpxchg(&mm->max_nr_cid, &cid, cid + 1))
+ while (max_nr_cid < allowed_max_nr_cid) {
+ if (!atomic_try_cmpxchg(&mm->max_nr_cid, &max_nr_cid, max_nr_cid + 1))
continue;
- if (!cpumask_test_and_set_cpu(cid, cidmask))
- return cid;
+ if (!cpumask_test_and_set_cpu(max_nr_cid, cidmask))
+ return max_nr_cid;
}
/*
* Find the first available concurrency id.
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com