On Thu, 2024-12-05 at 11:25 -0500, Mathieu Desnoyers wrote: > On 2024-12-05 09:33, Gabriele Monaco wrote: > > > Before sending a V2, however, I'd like to get some more insights > > about > > the requirements of this function. > > > > The current behaviour upstream is to call task_mm_cid_work for the > > task > > running after the scheduler tick. The function checks that we don't > > run > > too often for the same mm, but it seems possible that some process > > with > > short runtime would rarely run during the tick. > > > > So your concern is about a mm with threads running in short bursts, > and those would happen to rarely run while the tick interrupt is > triggered. We may indeed be missing something here, because the goal > is to ensure that we periodically do the task_mm_cid_work for each > mm. > > The side-effect of missing this work is not compacting the > mm_cid allocation cpumask. It won't cause rseq to fail per se, > but it will cause the mm_cid allocation to be less compact than > it should be. Yes, that was exactly the case, tasks like timerlat/cyclictest running periodically but doing very short work. Makes sense, now it's much clearer. > > > The behaviour imposed by this patch (at least the intended one) is > > to > > run the task_mm_cid_work with the configured periodicity (plus > > scheduling latency) for each active mm. > > What you propose looks like a more robust design than running under > the tick. > > > This behaviour seem to me more predictable, but would that even be > > required for rseq or is it just an overkill? > > Your approach looks more robust, so I would be tempted to introduce > it as a fix. Is the space/runtime overhead similar between the > tick/task work approach vs yours ? I'm going to fix the implementation and come up with some runtime stats to compare the overhead of both methods. As for the space overhead, I think I can answer this question already: * The current approach uses a callback_head per thread (16 bytes) * Mine relies on a delayed work per mm (88 bytes) Tasks with 5 threads or less have lower memory footprint with the current approach. I checked quickly on some systems I have access to and I'd say my approach introduces some memory overhead on an average system, but considering a task_struct can be 7-13 kB and an mm_struct is about 1.4 kB, the overhead should be acceptable. > > > > > In other words, was the tick chosen out of simplicity or is there > > some > > property that has to be preserved? > > Out of simplicity, and "do like what NUMA has done". But I am not > particularly attached to it. :-) > > > > > P.S. I run the rseq self tests on both this and the previous patch > > (both broken) and saw no failure. > > That's expected, because the tests do not so much depend on the > compactness of the mm_cid allocation. They way I validated this > in the past is by creating a simple multi-threaded program that > periodically prints the current mm_cid from userspace, and > sleep for a few seconds between printing, from many threads on > a many-core system. > > Then see how it reacts when run: are the mm_cid close to 0, or > are there large values of mm_cid allocated without compaction > over time ? I have not found a good way to translate this into > an automated test though. Ideas are welcome. > > You can look at the librseq basic_test as a starting point. [1] Perfect, will try those! Thanks, Gabriele