On Wed, 2025-02-19 at 12:08 -0500, Mathieu Desnoyers wrote: > On 2025-02-19 11:32, Gabriele Monaco wrote: > > > > > > On Wed, 2025-02-19 at 10:13 -0500, Mathieu Desnoyers wrote: > > > > On 2025-02-19 06:31, Gabriele Monaco wrote: > > > > > > Currently, the task_mm_cid_work function is called in a > > > > > > task work > > > > > > triggered by a scheduler tick to frequently compact the > > > > > > mm_cids of > > > > > > each > > > > > > process. This can delay the execution of the corresponding > > > > > > thread > > > > > > for > > > > > > the entire duration of the function, negatively affecting > > > > > > the > > > > > > response > > > > > > in case of real time tasks. In practice, we observe > > > > > > task_mm_cid_work > > > > > > increasing the latency of 30-35us on a 128 cores system, > > > > > > this order > > > > > > of > > > > > > magnitude is meaningful under PREEMPT_RT. > > > > > > > > > > > > Run the task_mm_cid_work in a new work_struct connected to > > > > > > the > > > > > > mm_struct rather than in the task context before returning > > > > > > to > > > > > > userspace. > > > > > > > > > > > > This work_struct is initialised with the mm and disabled > > > > > > before > > > > > > freeing > > > > > > it. Its execution is no longer triggered by scheduler > > > > > > ticks: the > > > > > > queuing > > > > > > of the work happens while returning to userspace in > > > > > > __rseq_handle_notify_resume, maintaining the checks to > > > > > > avoid > > > > > > running > > > > > > more frequently than MM_CID_SCAN_DELAY. > > > > > > > > > > > > The main advantage of this change is that the function can > > > > > > be > > > > > > offloaded > > > > > > to a different CPU and even preempted by RT tasks. > > > > > > > > > > > > Moreover, this new behaviour is more predictable with > > > > > > periodic > > > > > > tasks > > > > > > with short runtime, which may rarely run during a scheduler > > > > > > tick. > > > > > > Now, the work is always scheduled when the task returns to > > > > > > userspace. > > > > > > > > > > > > The work is disabled during mmdrop, since the function > > > > > > cannot sleep > > > > > > in > > > > > > all kernel configurations, we cannot wait for possibly > > > > > > running work > > > > > > items to terminate. We make sure the mm is valid in case > > > > > > the task > > > > > > is > > > > > > terminating by reserving it with mmgrab/mmdrop, returning > > > > > > prematurely if > > > > > > we are really the last user before mmgrab. > > > > > > This situation is unlikely since we don't schedule the work > > > > > > for > > > > > > exiting > > > > > > tasks, but we cannot rule it out. > > > > > > > > > > > > Fixes: 223baf9d17f2 ("sched: Fix performance regression > > > > > > introduced > > > > > > by mm_cid") > > > > > > Signed-off-by: Gabriele Monaco <gmonaco@xxxxxxxxxx> > > > > > > --- > > > > > > diff --git a/kernel/rseq.c b/kernel/rseq.c > > > > > > index 442aba29bc4cf..f8394ebbb6f4d 100644 > > > > > > --- a/kernel/rseq.c > > > > > > +++ b/kernel/rseq.c > > > > > > @@ -419,6 +419,7 @@ void __rseq_handle_notify_resume(struct > > > > > > ksignal > > > > > > *ksig, struct pt_regs *regs) > > > > > > } > > > > > > if (unlikely(rseq_update_cpu_node_id(t))) > > > > > > goto error; > > > > > > + task_queue_mm_cid(t); > > > > > > return; > > > > > > > > > > > > error: > > > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > > > > > index 9aecd914ac691..ee35f9962444b 100644 > > > > > > --- a/kernel/sched/core.c > > > > > > +++ b/kernel/sched/core.c > > > > > > @@ -5663,7 +5663,6 @@ void sched_tick(void) > > > > > > resched_latency = cpu_resched_latency(rq); > > > > > > calc_global_load_tick(rq); > > > > > > sched_core_tick(rq); > > > > > > - task_tick_mm_cid(rq, donor); > > > > > > > > I agree that this approach is promising, however I am concerned > > > > about > > > > the fact that a task running alone on its runqueue (thus > > > > without > > > > preemption) for a long time will never recompact mm_cid, and > > > > also > > > > will never update its mm_cid field. > > > > > > > > So I am tempted to insert this in the sched_tick to cover that > > > > scenario: > > > > > > > > rseq_preempt(current); > > > > > > > > This would ensure that the task runs > > > > __rseq_handle_notify_resume() at > > > > least each tick. > > > > > > > > Right, I thought about this scenario but forgot to add it in the > > final patch. > > We could have a test doing that: instead of sleeping, the task busy > > waits. > > > > Does __rseq_handle_notify_resume need to run in this case, besides > > for the cid compaction, I mean? Otherwise we could again just > > enqueu > > the work from there. > > Yes we need to do both: > > - compact cid, > - run __rseq_handle_notify_resume to update the mm_cid. > > We we don't care much if compacting the cid is done at some point > and __rseq_handle_notify_resume only updates the mm_cid field on > the following tick. > > So enqueuing the work is not sufficient there, I would really > issue rseq_preempt(current) which makes sure a busy thread both > triggers cid compaction *and* gets its mm_cid updated. > Sure, will do. I've been trying to test this scenario but it's quite hard to achieve. I set all threads to FIFO and highest priority, removed all system calls from the leader thread (even the ones to wait for other threads) and replaced the usleep with a busy wait, running on a VM so not sure if interrupts can bother. The test still passes.. Anyway it seems a reasonable situation to happens and I guess it won't hurt to just run an rseq_preempt in the tick. Testing and sending V8 without touching the selftest. Thanks, Gabriele