Re: [PATCH] sched: Move task_mm_cid_work to mm delayed work

Gabriele Monaco <gmonaco@xxxxxxxxxx> · Wed, 11 Dec 2024 13:27:22 +0100

On Mon, 2024-12-09 at 10:48 -0500, Mathieu Desnoyers wrote:
> On 2024-12-09 10:33, Mathieu Desnoyers wrote:
> > A small tweak on your proposed approach: in phase 1, get each
> > thread
> > to publish which mm_cid they observe, and select one thread which
> > has observed mm_cid > 1 (possibly the largest mm_cid) as the thread
> > that will keep running in phase 2 (in addition to the main thread).
> > 
> > All threads other than the main thread and that selected thread
> > exit
> > and are joined before phase 2.
> > 
> > So you end up in phase 2 with:
> > 
> > - main (observed any mm_cid)
> > - selected thread (observed mm_cid > 1, possibly largest)
> > 
> > Then after a while, the selected thread should observe a
> > mm_cid <= 1.
> > 
> > This test should be skipped if there are less than 3 CPUs in
> > allowed cpumask (sched_getaffinity).
> 
> Even better:
> 
> For a sched_getaffinity with N cpus:
> 
> - If N == 1 -> skip (we cannot validate anything)
> 
> Phase 1: create N - 1 pthreads, each pinned to a CPU. main thread
> also pinned to a cpu.
> 
> Publish the mm_cids observed by each thread, including main thread.
> 
> Select a new leader for phase 2: a thread which has observed nonzero
> mm_cid. Each other thread including possibly main thread issue
> pthread_exit, and the new leader does pthread join on each other.
> 
> Then check that the new leader eventually observe mm_cid == 0.
> 
> And it works with an allowed cpu mask that has only 2 cpus.

Sounds even neater, thanks for the tips, I'll try this last one out!

Coming back to the implementation, I have been trying to validate my
approach with this test, wrapped my head around it, and found out that
the test can't actually pass on the latest upstream.

When an mm_cid is lazy dropped to compact the mask, it is again re-
assigned while switching in.
The  the change introduced in "sched: Improve cache locality of RSEQ
concurrency IDs for intermittent workloads" adds a recent_cid and it
seems that is never unset during the test (nothing migrates).

Now, I'm still running my first version of the test, so I have a thread
running on CPU0 with mm_cid=0 and another running on CPU127 with
mm_cid, say, 127 (weight=2).
In practice, the test is expecting 127 to be dropped (>2) but this is
not the case since 127 could exhibit better cache locality, so it is
selected on the next round.

Here's where I'm in doubt, is a compact map more desirable than reusing
the same mm_cids for cache locality?
If not, should we perhaps ignore the recent_cid if it's larger than the
map weight?
It seems the only way the recent_cid is unset is with migrations, but
I'm not sure if forcing one would make the test vain as the cid could
be dropped outside of task_mm_cid_work.

What do you think?

Thanks,
Gabriele