Re: [PATCH] sched/topology: Remove EM_MAX_COMPLEXITY limit

Dietmar Eggemann <dietmar.eggemann@xxxxxxx> · Thu, 18 Aug 2022 14:19:47 +0200

On 12/08/2022 12:16, Pierre Gondois wrote:
> From: Pierre Gondois <Pierre.Gondois@xxxxxxx>

[...]

> find_energy_efficient_cpu() (feec) is now doing:
> feec()
> \_ for_each_pd(pd) [0]
>   // get max_spare_cap_cpu and compute_prev_delta
>   \_ for_each_cpu(pd) [1]
> 
>   \_ get_pd_busy_time(pd) [2]
>     \_ for_each_cpu(pd)
> 
>   // evaluate pd energy without the task
>   \_ get_pd_max_util(pd, -1) [3.0]
>     \_ for_each_cpu(pd)
>   \_ compute_energy(pd, -1)
>     \_ for_each_ps(pd)
> 
>   // evaluate pd energy with the task on prev_cpu
>   \_ get_pd_max_util(pd, prev_cpu) [3.1]
>     \_ for_each_cpu(pd)
>   \_ compute_energy(pd, prev_cpu)
>     \_ for_each_ps(pd)
> 
>   // evaluate pd energy with the task on max_spare_cap_cpu
>   \_ get_pd_max_util(pd, max_spare_cap_cpu) [3.2]
>     \_ for_each_cpu(pd)
>   \_ compute_energy(pd, max_spare_cap_cpu)
>     \_ for_each_ps(pd)
> 
> [3.1] happens only once since prev_cpu is unique. To have an upper
> bound of the complexity, [3.1] is taken into account for all pds.
> So with the same definitions for nr_pd, nr_cpus and nr_ps,
> the complexity is of:
> nr_pd * (2 * [nr_cpus in pd] + 3 * ([nr_cpus in pd] + [nr_ps in pd]))
>  [0]  * (     [1] + [2]      +       [3.0] + [3.1] + [3.2]          )
> = 5 * nr_cpus + 3 * nr_ps
> 
> The complexity limit was set to 2048 in:
> commit b68a4c0dba3b1 ("sched/topology: Disable EAS on inappropriate
> platforms")
> to make "EAS usable up to 16 CPUs with per-CPU DVFS and less than 8
> performance states each". For the same platform, the complexity would
> actually be of:
> 5 * 16 + 3 * 7 = 101

This is somewhat hard to grasp.

Example: 16 CPUs w/ per-CPU DVFS and < 8 performance states (OPPs) each

C  : Complexity

Nc : #CPUs in system
Ns : Sum of PSs (Performance States) over all PDs
Nd : #PDs

Nc' : #CPUs in PD
Ns' : #PSs in PD

(1) Currently we have:

    C = Nd * (Nc + Ns)

    Nc = 16, Nd = 16, Ns = 16 * 7

    C = 16 * (16 + 16 * 7)

      = 2048

(2) Your new formula is:

    Nc' = 1, Ns' = 7

    C = Nd * (2 * Nc' + 3 * (Nc' + Ns'))

      = Nd * (5 * Nc' + 3 * Ns')

      = 16 * (5 * 1 + 3 * 7)

      = 416

      = 5 * Nc + 3 * Ns

I would update the example and leave C ~ at 2048.

> Since the EAS complexity was greatly reduced, bigger platforms can
> handle EAS. For instance, a platform with 256 CPUs with 256
> performance states each would reach it. To reflect this improvement,
> remove the EAS complexity check.
> 
> Signed-off-by: Pierre Gondois <Pierre.Gondois@xxxxxxx>

We should definitely align feec()'s implementation with the EM
complexity check and documentation. I would suggest that we keep both in
place but we update them.

> ---
>  Documentation/scheduler/sched-energy.rst | 37 ++--------------------
>  kernel/sched/topology.c                  | 39 ++----------------------
>  2 files changed, 6 insertions(+), 70 deletions(-)
> 
> diff --git a/Documentation/scheduler/sched-energy.rst b/Documentation/scheduler/sched-energy.rst
> index 8fbce5e767d9..3d1d71134d16 100644
> --- a/Documentation/scheduler/sched-energy.rst
> +++ b/Documentation/scheduler/sched-energy.rst
> @@ -356,38 +356,7 @@ placement. For EAS it doesn't matter whether the EM power values are expressed
>  in milli-Watts or in an 'abstract scale'.
>  
>  
> -6.3 - Energy Model complexity
> -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> -
> -The task wake-up path is very latency-sensitive. When the EM of a platform is
> -too complex (too many CPUs, too many performance domains, too many performance
> -states, ...), the cost of using it in the wake-up path can become prohibitive.
> -The energy-aware wake-up algorithm has a complexity of:
> -
> -	C = Nd * (Nc + Ns)
> -
> -with: Nd the number of performance domains; Nc the number of CPUs; and Ns the
> -total number of OPPs (ex: for two perf. domains with 4 OPPs each, Ns = 8).
> -
> -A complexity check is performed at the root domain level, when scheduling
> -domains are built. EAS will not start on a root domain if its C happens to be
> -higher than the completely arbitrary EM_MAX_COMPLEXITY threshold (2048 at the
> -time of writing).
> -
> -If you really want to use EAS but the complexity of your platform's Energy
> -Model is too high to be used with a single root domain, you're left with only
> -two possible options:
> -
> -    1. split your system into separate, smaller, root domains using exclusive
> -       cpusets and enable EAS locally on each of them. This option has the
> -       benefit to work out of the box but the drawback of preventing load
> -       balance between root domains, which can result in an unbalanced system
> -       overall;
> -    2. submit patches to reduce the complexity of the EAS wake-up algorithm,
> -       hence enabling it to cope with larger EMs in reasonable time.
> -
> -

I see value in this paragraph. Obviously it has to match the actual
feec() implementation.

[...]