Re: [PATCH v7 1/2] sched: Move task_mm_cid_work to mm work_struct

Gabriele Monaco <gmonaco@xxxxxxxxxx> · Thu, 20 Feb 2025 09:00:25 +0100

On Wed, 2025-02-19 at 12:08 -0500, Mathieu Desnoyers wrote:
> On 2025-02-19 11:32, Gabriele Monaco wrote:
> > 
> > 
> > On Wed, 2025-02-19 at 10:13 -0500, Mathieu Desnoyers wrote:
> > > > On 2025-02-19 06:31, Gabriele Monaco wrote:
> > > > > > Currently, the task_mm_cid_work function is called in a
> > > > > > task work
> > > > > > triggered by a scheduler tick to frequently compact the
> > > > > > mm_cids of
> > > > > > each
> > > > > > process. This can delay the execution of the corresponding
> > > > > > thread
> > > > > > for
> > > > > > the entire duration of the function, negatively affecting
> > > > > > the
> > > > > > response
> > > > > > in case of real time tasks. In practice, we observe
> > > > > > task_mm_cid_work
> > > > > > increasing the latency of 30-35us on a 128 cores system,
> > > > > > this order
> > > > > > of
> > > > > > magnitude is meaningful under PREEMPT_RT.
> > > > > > 
> > > > > > Run the task_mm_cid_work in a new work_struct connected to
> > > > > > the
> > > > > > mm_struct rather than in the task context before returning
> > > > > > to
> > > > > > userspace.
> > > > > > 
> > > > > > This work_struct is initialised with the mm and disabled
> > > > > > before
> > > > > > freeing
> > > > > > it. Its execution is no longer triggered by scheduler
> > > > > > ticks: the
> > > > > > queuing
> > > > > > of the work happens while returning to userspace in
> > > > > > __rseq_handle_notify_resume, maintaining the checks to
> > > > > > avoid
> > > > > > running
> > > > > > more frequently than MM_CID_SCAN_DELAY.
> > > > > > 
> > > > > > The main advantage of this change is that the function can
> > > > > > be
> > > > > > offloaded
> > > > > > to a different CPU and even preempted by RT tasks.
> > > > > > 
> > > > > > Moreover, this new behaviour is more predictable with
> > > > > > periodic
> > > > > > tasks
> > > > > > with short runtime, which may rarely run during a scheduler
> > > > > > tick.
> > > > > > Now, the work is always scheduled when the task returns to
> > > > > > userspace.
> > > > > > 
> > > > > > The work is disabled during mmdrop, since the function
> > > > > > cannot sleep
> > > > > > in
> > > > > > all kernel configurations, we cannot wait for possibly
> > > > > > running work
> > > > > > items to terminate. We make sure the mm is valid in case
> > > > > > the task
> > > > > > is
> > > > > > terminating by reserving it with mmgrab/mmdrop, returning
> > > > > > prematurely if
> > > > > > we are really the last user before mmgrab.
> > > > > > This situation is unlikely since we don't schedule the work
> > > > > > for
> > > > > > exiting
> > > > > > tasks, but we cannot rule it out.
> > > > > > 
> > > > > > Fixes: 223baf9d17f2 ("sched: Fix performance regression
> > > > > > introduced
> > > > > > by mm_cid")
> > > > > > Signed-off-by: Gabriele Monaco <gmonaco@xxxxxxxxxx>
> > > > > > ---
> > > > > > diff --git a/kernel/rseq.c b/kernel/rseq.c
> > > > > > index 442aba29bc4cf..f8394ebbb6f4d 100644
> > > > > > --- a/kernel/rseq.c
> > > > > > +++ b/kernel/rseq.c
> > > > > > @@ -419,6 +419,7 @@ void __rseq_handle_notify_resume(struct
> > > > > > ksignal
> > > > > > *ksig, struct pt_regs *regs)
> > > > > >     }
> > > > > >     if (unlikely(rseq_update_cpu_node_id(t)))
> > > > > >     goto error;
> > > > > > + task_queue_mm_cid(t);
> > > > > >     return;
> > > > > >   
> > > > > >    error:
> > > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > > > > index 9aecd914ac691..ee35f9962444b 100644
> > > > > > --- a/kernel/sched/core.c
> > > > > > +++ b/kernel/sched/core.c
> > > > > > @@ -5663,7 +5663,6 @@ void sched_tick(void)
> > > > > >     resched_latency = cpu_resched_latency(rq);
> > > > > >     calc_global_load_tick(rq);
> > > > > >     sched_core_tick(rq);
> > > > > > - task_tick_mm_cid(rq, donor);
> > > > 
> > > > I agree that this approach is promising, however I am concerned
> > > > about
> > > > the fact that a task running alone on its runqueue (thus
> > > > without
> > > > preemption) for a long time will never recompact mm_cid, and
> > > > also
> > > > will never update its mm_cid field.
> > > > 
> > > > So I am tempted to insert this in the sched_tick to cover that
> > > > scenario:
> > > > 
> > > >        rseq_preempt(current);
> > > > 
> > > > This would ensure that the task runs
> > > > __rseq_handle_notify_resume() at
> > > > least each tick.
> > > > 
> > 
> > Right, I thought about this scenario but forgot to add it in the
> > final patch.
> > We could have a test doing that: instead of sleeping, the task busy
> > waits.
> > 
> > Does __rseq_handle_notify_resume need to run in this case, besides
> > for the cid compaction, I mean? Otherwise we could again just
> > enqueu
> > the work from there.
> 
> Yes we need to do both:
> 
> - compact cid,
> - run __rseq_handle_notify_resume to update the mm_cid.
> 
> We we don't care much if compacting the cid is done at some point
> and __rseq_handle_notify_resume only updates the mm_cid field on
> the following tick.
> 
> So enqueuing the work is not sufficient there, I would really
> issue rseq_preempt(current) which makes sure a busy thread both
> triggers cid compaction *and* gets its mm_cid updated.
> 

Sure, will do.

I've been trying to test this scenario but it's quite hard to achieve.
I set all threads to FIFO and highest priority, removed all system
calls from the leader thread (even the ones to wait for other threads)
and replaced the usleep with a busy wait, running on a VM so not sure
if interrupts can bother. The test still passes..

Anyway it seems a reasonable situation to happens and I guess it won't
hurt to just run an rseq_preempt in the tick.

Testing and sending V8 without touching the selftest.

Thanks,
Gabriele