On 2024-12-05 09:33, Gabriele Monaco wrote:
The patch is fundamentally broken since I somehow lost the line calling schedule_delayed_work in task_mm_cid_work to re-schedule itself.
Yes, I was puzzled about it when reviewing your patch.
Before sending a V2, however, I'd like to get some more insights about the requirements of this function. The current behaviour upstream is to call task_mm_cid_work for the task running after the scheduler tick. The function checks that we don't run too often for the same mm, but it seems possible that some process with short runtime would rarely run during the tick.
So your concern is about a mm with threads running in short bursts, and those would happen to rarely run while the tick interrupt is triggered. We may indeed be missing something here, because the goal is to ensure that we periodically do the task_mm_cid_work for each mm. The side-effect of missing this work is not compacting the mm_cid allocation cpumask. It won't cause rseq to fail per se, but it will cause the mm_cid allocation to be less compact than it should be.
The behaviour imposed by this patch (at least the intended one) is to run the task_mm_cid_work with the configured periodicity (plus scheduling latency) for each active mm.
What you propose looks like a more robust design than running under the tick.
This behaviour seem to me more predictable, but would that even be required for rseq or is it just an overkill?
Your approach looks more robust, so I would be tempted to introduce it as a fix. Is the space/runtime overhead similar between the tick/task work approach vs yours ?
In other words, was the tick chosen out of simplicity or is there some property that has to be preserved?
Out of simplicity, and "do like what NUMA has done". But I am not particularly attached to it. :-)
P.S. I run the rseq self tests on both this and the previous patch (both broken) and saw no failure.
That's expected, because the tests do not so much depend on the compactness of the mm_cid allocation. They way I validated this in the past is by creating a simple multi-threaded program that periodically prints the current mm_cid from userspace, and sleep for a few seconds between printing, from many threads on a many-core system. Then see how it reacts when run: are the mm_cid close to 0, or are there large values of mm_cid allocated without compaction over time ? I have not found a good way to translate this into an automated test though. Ideas are welcome. You can look at the librseq basic_test as a starting point. [1] Thanks, Mathieu [1] https://github.com/compudj/librseq/blob/master/tests/basic_test.c
Thanks, Gabriele
-- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com