Re: [PATCH] sched: Move task_mm_cid_work to mm delayed work

Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> · Fri, 6 Dec 2024 09:06:10 -0500

On 2024-12-06 03:53, Gabriele Monaco wrote:
On Thu, 2024-12-05 at 11:25 -0500, Mathieu Desnoyers wrote:
[...]

The behaviour imposed by this patch (at least the intended one) is
to
run the task_mm_cid_work with the configured periodicity (plus
scheduling latency) for each active mm.

What you propose looks like a more robust design than running under
the tick.

This behaviour seem to me more predictable, but would that even be
required for rseq or is it just an overkill?

Your approach looks more robust, so I would be tempted to introduce
it as a fix. Is the space/runtime overhead similar between the
tick/task work approach vs yours ?

I'm going to fix the implementation and come up with some runtime stats
to compare the overhead of both methods.
As for the space overhead, I think I can answer this question already:
* The current approach uses a callback_head per thread (16 bytes)
* Mine relies on a delayed work per mm (88 bytes)

Tasks with 5 threads or less have lower memory footprint with the
current approach.
I checked quickly on some systems I have access to and I'd say my
approach introduces some memory overhead on an average system, but
considering a task_struct can be 7-13 kB and an mm_struct is about 1.4
kB, the overhead should be acceptable.

ok!

In other words, was the tick chosen out of simplicity or is there
some
property that has to be preserved?

Out of simplicity, and "do like what NUMA has done". But I am not
particularly attached to it. :-)

P.S. I run the rseq self tests on both this and the previous patch
(both broken) and saw no failure.

That's expected, because the tests do not so much depend on the
compactness of the mm_cid allocation. They way I validated this
in the past is by creating a simple multi-threaded program that
periodically prints the current mm_cid from userspace, and
sleep for a few seconds between printing, from many threads on
a many-core system.

Then see how it reacts when run: are the mm_cid close to 0, or
are there large values of mm_cid allocated without compaction
over time ? I have not found a good way to translate this into
an automated test though. Ideas are welcome.

You can look at the librseq basic_test as a starting point. [1]

Perfect, will try those!

Thinking back on this, you'll want a program that does the following
on a system with N CPUs:

- Phase 1: run one thread per cpu, pinned on each cpu. Print the
  mm_cid from each thread with the cpu number every second or so.

- Exit all threads except the main thread, join them from the main
  thread,

- Phase 2: the program is now single-threaded. We'd expect the
  mm_cid value to converge towards 0 as the periodic task clears
  unused CIDs.

So I think in phase 2 we can have an actual automated test: If after
an order of magnitude more time than the 100ms delay between periodic
tasks we still observe mm_cid > 0 in phase 2, then something is
wrong.

Thoughts ?

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com