On Mon, 2025-02-17 at 14:46 -0500, Mathieu Desnoyers wrote: > On 2025-02-17 06:23, Gabriele Monaco wrote: > > Currently, the task_mm_cid_work function is called in a task work > > triggered by a scheduler tick to frequently compact the mm_cids of > > each > > process for each core. This can delay the execution of the > > corresponding > > thread for the entire duration of the function, negatively > > affecting the > > response in case of real time tasks. In practice, we observe > > task_mm_cid_work increasing the latency of 30-35us on a 128 cores > > system, this order of magnitude is meaningful under PREEMPT_RT. > > > > Run the task_mm_cid_work in batches of up to > > CONFIG_RSEQ_CID_SCAN_BATCH > > cpus, this contains the duration of the delay for each scan. > > Also improve the duration by iterating for all present cpus and not > > for > > all possible. > > Iterating only on present cpus is not enough on CONFIG_HOTPLUG=y, > because ACPI can dynamically add/remove CPUs from the set. If we end > up iterating only on present cpus, then we need to add a cpu hotplug > callback to handle the removal case, and I'm not sure the added > complexity is worth it here. > Got it, didn't think of that.. > > > > The task_mm_cid_work already contains a mechanism to avoid running > > more > > frequently than every 100ms, considering the function runs at every > > tick, assuming ticks every 1ms (HZ=1000 is common on distros) and > > assuming an unfavorable scenario of 1/10 ticks during task T > > runtime, we > > can compact the CIDs for task T in about 130ms by setting > > CONFIG_RSEQ_CID_SCAN_BATCH to 10 on a 128 cores machine. > > This value also drastically reduces the task work duration and is a > > more > > acceptable latency for the aforementioned machine. > > > > Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced > > by mm_cid") > > Signed-off-by: Gabriele Monaco <gmonaco@xxxxxxxxxx> > > --- > > include/linux/mm_types.h | 8 ++++++++ > > init/Kconfig | 12 ++++++++++++ > > kernel/sched/core.c | 27 ++++++++++++++++++++++++--- > > 3 files changed, 44 insertions(+), 3 deletions(-) > > > > @@ -10546,6 +10546,15 @@ static void task_mm_cid_work(struct > > callback_head *work) > > mm = t->mm; > > if (!mm) > > return; > > + cpu = from_cpu = READ_ONCE(mm->mm_cid_scan_cpu); > > + to_cpu = from_cpu + CONFIG_RSEQ_CID_SCAN_BATCH; > > + if (from_cpu > cpumask_last(cpu_present_mask)) { > > See explanation about using possible rather than present. > > > + from_cpu = 0; > > + to_cpu = CONFIG_RSEQ_CID_SCAN_BATCH; > > If the cpu_possible_mask is sparsely populated, this will end > up doing batches that hit very few cpus. Instead, we should > count how many cpus are handled within each > for_each_cpu_from(cpu, cpu_possible_mask) loops below and break > when reaching CONFIG_RSEQ_CID_SCAN_BATCH. > > > + } > > [...] > > + for_each_cpu_from(cpu, cpu_present_mask) { > > + if (cpu == to_cpu) > > + break; > > sched_mm_cid_remote_clear_weight(mm, cpu, weight); > > + } > > Here set mm->mm_cid_scan_cpu to the new next position which is > the result from the "for each" loop. > Mmh, good point, I wonder though if we need to care for multiple threads scanning the same mm concurrently. In my patch it shouldn't happen (threads /book/ up to to_cpu writing it before scanning). To do so, I'd probably need to create a map with N elements starting from from_cpu and use that, or have a dry loop before actually scanning. Thanks, Gabriele