Re: [RFC PATCH v2 0/3] sched/fair: introduce new scheduler group type group_parked

Shrikanth Hegde <sshegde@xxxxxxxxxxxxx> · Tue, 25 Feb 2025 16:03:09 +0530

On 2/20/25 16:25, Tobias Huschle wrote:


On 18/02/2025 06:58, Shrikanth Hegde wrote:
[...]

There are a couple of issues and corner cases which need further
considerations:
- rt & dl:      Realtime and deadline scheduling require some additional
                 attention.

I think we need to address atleast rt, there would be some non percpu 
kworker threads which need to move out of parked cpus.


Yea, sounds reasonable. Would probably make sense to go next for that one.

Ok. I was experimenting with rt code. Its all quite new to me.
Was able to get non-bound rt tasks honor the cpu parked state. However it works only
if the rt tasks performs some wakeups. (for example, start hackbench with chrt -r 10)

If it is continuously running (for example stress-ng with chrt -r 10), then it doesn't pack at runtime when
CPUs become parked after it started running. Not sure how many RT tasks behave that way.
It packs when starting afresh when CPUs are already parked and unpacks when CPUs become unparked though.


Added some prints in rt code to understand. A few observations:
1. balance_rt or rt_pull_tasks don't get called once stress-ng starts running.
That means there is no opportunity to pull the tasks or load balance?
It gets called when migration is running, but that can't be balanced.
Is there a way to trigger load balance of rt tasks when the task doesn't give up the CPU?

2. Regular load balance (sched_balance_rq) does get called even when the CPU is only
running the rt tasks. It tries to do the load balance (i.e passes update_sd_lb_stats etc),
but will not do a actual balance because it only works on src_rq->cfs_tasks.
That maybe a opportunity to skip the load balance if the CPU is running the RT task?
i.e CPU is not idle and chosen as the CPU do the load balancing because its the first CPU
in the group and its running only RT task.

Can Point 1 be addressed? and Is point 2 makes sense?
Also please suggest a better way if there is one compared to the patch below.

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 4b8e33c615b1..4da2e60da9a8 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -462,6 +462,9 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
        unsigned int max_cap;
        unsigned int cpu_cap;
 
+       if (arch_cpu_parked(cpu))
+               return false;
+
        /* Only heterogeneous systems can benefit from this check */
        if (!sched_asym_cpucap_active())
                return true;
@@ -476,6 +479,9 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
 #else
 static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
 {
+       if (arch_cpu_parked(cpu))
+               return false;
+
        return true;
 }
 #endif
@@ -1801,6 +1807,8 @@ static int find_lowest_rq(struct task_struct *task)
        int this_cpu = smp_processor_id();
        int cpu      = task_cpu(task);
        int ret;
+       int parked_cpu = 0;
+       int tmp_cpu;
 
        /* Make sure the mask is initialized first */
        if (unlikely(!lowest_mask))
@@ -1809,11 +1817,18 @@ static int find_lowest_rq(struct task_struct *task)
        if (task->nr_cpus_allowed == 1)
                return -1; /* No other targets possible */
 
+       for_each_cpu(tmp_cpu, cpu_online_mask) {
+               if (arch_cpu_parked(tmp_cpu)) {
+                       parked_cpu = tmp_cpu;
+                       break;
+               }
+       }
+
        /*
         * If we're on asym system ensure we consider the different capacities
         * of the CPUs when searching for the lowest_mask.
         */
-       if (sched_asym_cpucap_active()) {
+       if (sched_asym_cpucap_active() || parked_cpu > -1) {
 
                ret = cpupri_find_fitness(&task_rq(task)->rd->cpupri,
                                          task, lowest_mask,
@@ -1835,14 +1850,14 @@ static int find_lowest_rq(struct task_struct *task)
         * We prioritize the last CPU that the task executed on since
         * it is most likely cache-hot in that location.
         */
-       if (cpumask_test_cpu(cpu, lowest_mask))
+       if (cpumask_test_cpu(cpu, lowest_mask) && !arch_cpu_parked(cpu))
                return cpu;
 
        /*
         * Otherwise, we consult the sched_domains span maps to figure
         * out which CPU is logically closest to our hot cache data.
         */
-       if (!cpumask_test_cpu(this_cpu, lowest_mask))
+       if (!cpumask_test_cpu(this_cpu, lowest_mask) || arch_cpu_parked(this_cpu))
                this_cpu = -1; /* Skip this_cpu opt if not among lowest */
 
        rcu_read_lock();
@@ -1862,7 +1877,7 @@ static int find_lowest_rq(struct task_struct *task)
 
                        best_cpu = cpumask_any_and_distribute(lowest_mask,
                                                              sched_domain_span(sd));
-                       if (best_cpu < nr_cpu_ids) {
+                       if (best_cpu < nr_cpu_ids  && !arch_cpu_parked(best_cpu)) {
                                rcu_read_unlock();
                                return best_cpu;
                        }
@@ -1879,7 +1894,7 @@ static int find_lowest_rq(struct task_struct *task)
                return this_cpu;
 
        cpu = cpumask_any_distribute(lowest_mask);
-       if (cpu < nr_cpu_ids)
+       if (cpu < nr_cpu_ids && !arch_cpu_parked(cpu))
                return cpu;
 
        return -1;


Meanwhile, i will continue looking at code to understand it better.


- ext:          Probably affected as well. Needs some conceptional
                 thoughts first.
- raciness:     Right now, there are no synchronization efforts. It 
needs
                 to be considered whether those might be necessary or if
                 it is alright that the parked-state of a CPU might 
change
                 during load-balancing.

Patches apply to tip:sched/core

The s390 patch serves as a simplified implementation example.


Gave it a try on powerpc with the debugfs file. it works for 
sched_normal tasks.


That's great to hear!


Tobias Huschle (3):
   sched/fair: introduce new scheduler group type group_parked
   sched/fair: adapt scheduler group weight and capacity for parked CPUs
   s390/topology: Add initial implementation for selection of parked 
CPUs

  arch/s390/include/asm/smp.h    |   2 +
  arch/s390/kernel/smp.c         |   5 ++
  include/linux/sched/topology.h |  19 ++++++
  kernel/sched/core.c            |  13 ++++-
  kernel/sched/fair.c            | 104 ++++++++++++++++++++++++++++-----
  kernel/sched/syscalls.c        |   3 +
  6 files changed, 130 insertions(+), 16 deletions(-)