Here is a follow up with some fairness fixes related to throttling, PELT and load decay in general. It is related to the discussion in: https://lore.kernel.org/lkml/20210425080902.11854-1-odin@xxxxxxx and https://lkml.kernel.org/r/20210501141950.23622-2-odin@xxxxxxx Tested on v5.13-rc2 (since that contain the fix from above^). The patch descriptions should make sense in its own, and I have attached some simple reproduction scripts at the end of this mail. I also appended a patch fixing some ascii art that I have been looking at several times without understanding, when it turns out it breaks if tabs is not 8 spaces. I can submit that as a separate patch if necessary. Also, I have no idea what to call the "insert_on_unthrottle" var, so feel free to come with suggestions. There are probably "better" and more reliable ways to reproduce this, but these works for me "most of the time", and gives an ok context imo. Throttling is not deterministic, so keep that in mind. I have been testing with CONFIG_HZ=250, so if you use =1000 (or anything else), you might get other results/harder to reproduce. Reprod script for "Add tg_load_contrib cfs_rq decay checking": --- bash start CGROUP=/sys/fs/cgroup/slice function run_sandbox { local CG="$1" local LCPU="$2" local SHARES="$3" local CMD="$4" local PIPE="$(mktemp -u)" mkfifo "$PIPE" sh -c "read < $PIPE ; exec $CMD" & local TASK="$!" mkdir -p "$CG/sub" tee "$CG"/cgroup.subtree_control <<< "+cpuset +cpu" tee "$CG"/sub/cgroup.procs <<< "$TASK" tee "$CG"/sub/cpuset.cpus <<< "$LCPU" tee "$CG"/sub/cpu.weight <<< "$SHARES" tee "$CG"/cpu.max <<< "10000 100000" sleep .1 tee "$PIPE" <<< sandox_done rm "$PIPE" } mkdir -p "$CGROUP" tee "$CGROUP"/cgroup.subtree_control <<< "+cpuset +cpu" run_sandbox "$CGROUP/cg-1" "0" 100 "stress --cpu 1" run_sandbox "$CGROUP/cg-2" "3" 100 "stress --cpu 1" sleep 1.02 tee "$CGROUP"/cg-1/sub/cpuset.cpus <<< "1" sleep 1.05 tee "$CGROUP"/cg-1/sub/cpuset.cpus <<< "2" sleep 1.07 tee "$CGROUP"/cg-1/sub/cpuset.cpus <<< "3" sleep 2 tee "$CGROUP"/cg-1/cpu.max <<< "max" tee "$CGROUP"/cg-2/cpu.max <<< "max" read killall stress sleep .2 rmdir /sys/fs/cgroup/slice/{cg-{1,2}{/sub,},} # Often gives: # cat /sys/kernel/debug/sched/debug | grep ":/slice" -A 28 | egrep "(:/slice)|tg_load_avg" odin@4670k # # cfs_rq[3]:/slice/cg-2/sub # .tg_load_avg_contrib : 1024 # .tg_load_avg : 1024 # cfs_rq[3]:/slice/cg-1/sub # .tg_load_avg_contrib : 1023 # .tg_load_avg : 1023 # cfs_rq[3]:/slice/cg-1 # .tg_load_avg_contrib : 1040 # .tg_load_avg : 2062 # cfs_rq[3]:/slice/cg-2 # .tg_load_avg_contrib : 1013 # .tg_load_avg : 1013 # cfs_rq[3]:/slice # .tg_load_avg_contrib : 1540 # .tg_load_avg : 1540 --- bash end Reprod for "sched/fair: Correctly insert cfs_rqs to list on unthrottle": --- bash start CGROUP=/sys/fs/cgroup/slice TMP_CG=/sys/fs/cgroup/tmp OLD_CG=/sys/fs/cgroup"$(cat /proc/self/cgroup | cut -c4-)" function run_sandbox { local CG="$1" local LCPU="$2" local SHARES="$3" local CMD="$4" local PIPE="$(mktemp -u)" mkfifo "$PIPE" sh -c "read < $PIPE ; exec $CMD" & local TASK="$!" mkdir -p "$CG/sub" tee "$CG"/cgroup.subtree_control <<< "+cpuset +cpu" tee "$CG"/sub/cpuset.cpus <<< "$LCPU" tee "$CG"/sub/cgroup.procs <<< "$TASK" tee "$CG"/sub/cpu.weight <<< "$SHARES" sleep .01 tee "$PIPE" <<< sandox_done rm "$PIPE" } mkdir -p "$CGROUP" mkdir -p "$TMP_CG" tee "$CGROUP"/cgroup.subtree_control <<< "+cpuset +cpu" echo $$ | tee "$TMP_CG"/cgroup.procs tee "$TMP_CG"/cpuset.cpus <<< "0" sleep .1 tee "$CGROUP"/cpu.max <<< "1000 4000" run_sandbox "$CGROUP/cg-0" "0" 10000 "stress --cpu 1" run_sandbox "$CGROUP/cg-3" "3" 1 "stress --cpu 1" sleep 2 tee "$CGROUP"/cg-0/sub/cpuset.cpus <<< "3" tee "$CGROUP"/cpu.max <<< "max" read killall stress sleep .2 echo $$ | tee "$OLD_CG"/cgroup.procs rmdir "$TMP_CG" /sys/fs/cgroup/slice/{cg-{0,3}{/sub,},} # Often gives: # cat /sys/kernel/debug/sched/debug | grep ":/slice" -A 28 | egrep "(:/slice)|tg_load_avg" odin@4670k # # cfs_rq[3]:/slice/cg-3/sub # .tg_load_avg_contrib : 1039 # .tg_load_avg : 2036 # cfs_rq[3]:/slice/cg-0/sub # .tg_load_avg_contrib : 1023 # .tg_load_avg : 1023 # cfs_rq[3]:/slice/cg-0 # .tg_load_avg_contrib : 102225 # .tg_load_avg : 102225 # cfs_rq[3]:/slice/cg-3 # .tg_load_avg_contrib : 4 # .tg_load_avg : 1001 # cfs_rq[3]:/slice # .tg_load_avg_contrib : 1038 # .tg_load_avg : 1038 --- bash end Thanks Odin Odin Ugedal (3): sched/fair: Add tg_load_contrib cfs_rq decay checking sched/fair: Correctly insert cfs_rq's to list on unthrottle sched/fair: Fix ascii art by relpacing tabs kernel/sched/fair.c | 22 +++++++++++++--------- kernel/sched/sched.h | 1 + 2 files changed, 14 insertions(+), 9 deletions(-) -- 2.31.1