Re: [PATCH] sched: Move task_mm_cid_work to mm delayed work

Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> · Thu, 5 Dec 2024 11:25:34 -0500

On 2024-12-05 09:33, Gabriele Monaco wrote:
The patch is fundamentally broken since I somehow lost the line calling
schedule_delayed_work in task_mm_cid_work to re-schedule itself.

Yes, I was puzzled about it when reviewing your patch.

Before sending a V2, however, I'd like to get some more insights about
the requirements of this function.

The current behaviour upstream is to call task_mm_cid_work for the task
running after the scheduler tick. The function checks that we don't run
too often for the same mm, but it seems possible that some process with
short runtime would rarely run during the tick.

So your concern is about a mm with threads running in short bursts,
and those would happen to rarely run while the tick interrupt is
triggered. We may indeed be missing something here, because the goal
is to ensure that we periodically do the task_mm_cid_work for each mm.

The side-effect of missing this work is not compacting the
mm_cid allocation cpumask. It won't cause rseq to fail per se,
but it will cause the mm_cid allocation to be less compact than
it should be.

The behaviour imposed by this patch (at least the intended one) is to
run the task_mm_cid_work with the configured periodicity (plus
scheduling latency) for each active mm.

What you propose looks like a more robust design than running under
the tick.

This behaviour seem to me more predictable, but would that even be
required for rseq or is it just an overkill?

Your approach looks more robust, so I would be tempted to introduce
it as a fix. Is the space/runtime overhead similar between the
tick/task work approach vs yours ?

In other words, was the tick chosen out of simplicity or is there some
property that has to be preserved?

Out of simplicity, and "do like what NUMA has done". But I am not
particularly attached to it. :-)

P.S. I run the rseq self tests on both this and the previous patch
(both broken) and saw no failure.

That's expected, because the tests do not so much depend on the
compactness of the mm_cid allocation. They way I validated this
in the past is by creating a simple multi-threaded program that
periodically prints the current mm_cid from userspace, and
sleep for a few seconds between printing, from many threads on
a many-core system.

Then see how it reacts when run: are the mm_cid close to 0, or
are there large values of mm_cid allocated without compaction
over time ? I have not found a good way to translate this into
an automated test though. Ideas are welcome.

You can look at the librseq basic_test as a starting point. [1]

Thanks,

Mathieu

[1] https://github.com/compudj/librseq/blob/master/tests/basic_test.c

Thanks,
Gabriele

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com