On Wed, 10 Jan 2024 at 19:10, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote: > > On 09/01/2024 14:29, Vincent Guittot wrote: > > On Tue, 9 Jan 2024 at 12:34, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote: > >> > >> On 08/01/2024 14:48, Vincent Guittot wrote: > >>> Following the consolidation and cleanup of CPU capacity in [1], this serie > >>> reworks how the scheduler gets the pressures on CPUs. We need to take into > >>> account all pressures applied by cpufreq on the compute capacity of a CPU > >>> for dozens of ms or more and not only cpufreq cooling device or HW > >>> mitigiations. we split the pressure applied on CPU's capacity in 2 parts: > >>> - one from cpufreq and freq_qos > >>> - one from HW high freq mitigiation. > >>> > >>> The next step will be to add a dedicated interface for long standing > >>> capping of the CPU capacity (i.e. for seconds or more) like the > >>> scaling_max_freq of cpufreq sysfs. The latter is already taken into > >>> account by this serie but as a temporary pressure which is not always the > >>> best choice when we know that it will happen for seconds or more. > >> > >> I guess this is related to the 'user space system pressure' (*) slide of > >> your OSPM '23 talk. > > > > yes > > > >> > >> Where do you draw the line when it comes to time between (*) and the > >> 'medium pace system pressure' (e.g. thermal and FREQ_QOS). > > > > My goal is to consider the /sys/../scaling_max_freq as the 'user space > > system pressure' > > > >> > >> IIRC, with (*) you want to rebuild the sched domains etc. > > > > The easiest way would be to rebuild the sched_domain but the cost is > > not small so I would prefer to skip the rebuild and add a new signal > > that keep track on this capped capacity > > Are you saying that you don't need to rebuild sched domains since > cpu_capacity information of the sched domain hierarchy is > independently updated via: > > update_sd_lb_stats() { > > update_group_capacity() { > > if (!child) > update_cpu_capacity(sd, cpu) { > > capacity = scale_rt_capacity(cpu) { > > max = get_actual_cpu_capacity(cpu) <- (*) > } > > sdg->sgc->capacity = capacity; > sdg->sgc->min_capacity = capacity; > sdg->sgc->max_capacity = capacity; > } > > } > > } > > (*) influence of temporary and permanent (to be added) frequency > pressure on cpu_capacity (per-cpu and in sd data) I'm more concerned by rd->max_cpu_capacity which remains at original capacity and triggers spurious LB if we take into account the userspace max freq instead of the original max compute capacity of a CPU. And also how to manage this in RT and DL > > > example: hackbench on h960 with IPA: > cap min max > ... > hackbench-2284 [007] .Ns.. 2170.796726: update_group_capacity: sdg !child cpu=7 1017 1017 1017 > hackbench-2456 [007] ..s.. 2170.920729: update_group_capacity: sdg !child cpu=7 1018 1018 1018 > <...>-2314 [007] ..s1. 2171.044724: update_group_capacity: sdg !child cpu=7 1011 1011 1011 > hackbench-2541 [007] ..s.. 2171.168734: update_group_capacity: sdg !child cpu=7 918 918 918 > hackbench-2558 [007] .Ns.. 2171.228716: update_group_capacity: sdg !child cpu=7 912 912 912 > <...>-2321 [007] ..s.. 2171.352718: update_group_capacity: sdg !child cpu=7 812 812 812 > hackbench-2553 [007] ..s.. 2171.476721: update_group_capacity: sdg !child cpu=7 640 640 640 > <...>-2446 [007] ..s2. 2171.600743: update_group_capacity: sdg !child cpu=7 610 610 610 > hackbench-2347 [007] ..s.. 2171.724738: update_group_capacity: sdg !child cpu=7 406 406 406 > hackbench-2331 [007] .Ns1. 2171.848768: update_group_capacity: sdg !child cpu=7 390 390 390 > hackbench-2421 [007] ..s.. 2171.972733: update_group_capacity: sdg !child cpu=7 388 388 388 > ...