Re: [PATCH v7 1/2] sched: Move task_mm_cid_work to mm work_struct

Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> · Wed, 19 Feb 2025 12:08:35 -0500

On 2025-02-19 11:32, Gabriele Monaco wrote:


On Wed, 2025-02-19 at 10:13 -0500, Mathieu Desnoyers wrote:
On 2025-02-19 06:31, Gabriele Monaco wrote:
Currently, the task_mm_cid_work function is called in a task work
triggered by a scheduler tick to frequently compact the mm_cids of
each
process. This can delay the execution of the corresponding thread
for
the entire duration of the function, negatively affecting the
response
in case of real time tasks. In practice, we observe
task_mm_cid_work
increasing the latency of 30-35us on a 128 cores system, this order
of
magnitude is meaningful under PREEMPT_RT.

Run the task_mm_cid_work in a new work_struct connected to the
mm_struct rather than in the task context before returning to
userspace.

This work_struct is initialised with the mm and disabled before
freeing
it. Its execution is no longer triggered by scheduler ticks: the
queuing
of the work happens while returning to userspace in
__rseq_handle_notify_resume, maintaining the checks to avoid
running
more frequently than MM_CID_SCAN_DELAY.

The main advantage of this change is that the function can be
offloaded
to a different CPU and even preempted by RT tasks.

Moreover, this new behaviour is more predictable with periodic
tasks
with short runtime, which may rarely run during a scheduler tick.
Now, the work is always scheduled when the task returns to
userspace.

The work is disabled during mmdrop, since the function cannot sleep
in
all kernel configurations, we cannot wait for possibly running work
items to terminate. We make sure the mm is valid in case the task
is
terminating by reserving it with mmgrab/mmdrop, returning
prematurely if
we are really the last user before mmgrab.
This situation is unlikely since we don't schedule the work for
exiting
tasks, but we cannot rule it out.

Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced
by mm_cid")
Signed-off-by: Gabriele Monaco <gmonaco@xxxxxxxxxx>
---

diff --git a/kernel/rseq.c b/kernel/rseq.c
index 442aba29bc4cf..f8394ebbb6f4d 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -419,6 +419,7 @@ void __rseq_handle_notify_resume(struct ksignal
*ksig, struct pt_regs *regs)
    }
    if (unlikely(rseq_update_cpu_node_id(t)))
    goto error;
+ task_queue_mm_cid(t);
    return;
  
   error:
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9aecd914ac691..ee35f9962444b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5663,7 +5663,6 @@ void sched_tick(void)
    resched_latency = cpu_resched_latency(rq);
    calc_global_load_tick(rq);
    sched_core_tick(rq);
- task_tick_mm_cid(rq, donor);

I agree that this approach is promising, however I am concerned about
the fact that a task running alone on its runqueue (thus without
preemption) for a long time will never recompact mm_cid, and also
will never update its mm_cid field.

So I am tempted to insert this in the sched_tick to cover that
scenario:

       rseq_preempt(current);

This would ensure that the task runs __rseq_handle_notify_resume() at
least each tick.


Right, I thought about this scenario but forgot to add it in the final patch.
We could have a test doing that: instead of sleeping, the task busy waits.

Does __rseq_handle_notify_resume need to run in this case, besides for the cid compaction, I mean? Otherwise we could again just enqueu
the work from there.

Yes we need to do both:

- compact cid,
- run __rseq_handle_notify_resume to update the mm_cid.

We we don't care much if compacting the cid is done at some point
and __rseq_handle_notify_resume only updates the mm_cid field on
the following tick.

So enqueuing the work is not sufficient there, I would really
issue rseq_preempt(current) which makes sure a busy thread both
triggers cid compaction *and* gets its mm_cid updated.


I'll give a shot for both.


    scx_tick(rq);
  
    rq_unlock(rq, &rf);
@@ -10530,22 +10529,16 @@ static void
sched_mm_cid_remote_clear_weight(struct mm_struct *mm, int cpu,
    sched_mm_cid_remote_clear(mm, pcpu_cid, cpu);
   }
  
-static void task_mm_cid_work(struct callback_head *work)
+void task_mm_cid_work(struct work_struct *work)
   {
    unsigned long now = jiffies, old_scan, next_scan;
- struct task_struct *t = current;
    struct cpumask *cidmask;
- struct mm_struct *mm;
+ struct mm_struct *mm = container_of(work, struct mm_struct,
cid_work);
    int weight, cpu;
  
- SCHED_WARN_ON(t != container_of(work, struct task_struct,
cid_work));
-
- work->next = work; /* Prevent double-add */
- if (t->flags & PF_EXITING)
- return;
- mm = t->mm;
- if (!mm)
+ if (!atomic_read(&mm->mm_count))
    return;
+ mmgrab(mm);

AFAIU this is racy with respect to re-use of mm struct.

I recommend that you move mmgrab() to task_queue_mm_cid() just before
invoking schedule_work. That way you ensure that the mm count never
reaches 0 while there is work in flight (and therefore guarantee that
the mm is not re-used).



Mmh good point, in that case I think we can still keep on testing the mm_count and return prematurely if it's 1 (we are the only user and the task exited before the work got scheduled).
That would be a safe assumption if we don't get to 0, wouldn't it?

Yes, although don't forget the mmdrop in that case ;)

Thanks,

Mathieu



Thanks,
Gabriele



--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com