On Sun, 25 Apr 2021 at 10:11, Odin Ugedal <odin@xxxxxxx> wrote: > > TL;DR: Read the patch msg first; hopefully it makes sense. Feel free > to cc more relevant people, and (as I am pretty new to this) feedback > is highly appreciated! > > I thought it was smart to split the discussion of the patch and the issuse, so > here is some (wall of text, sorry) discussion around it: > > This patch fixes an extremely strange edge case, discovered more or less by > accident, where the fairness of the cfs scheduler get skewed quite > extensively. This often leads to two cgroups who are supposed to get > share cpu time 50/50, instead getting something like 90/10 or > even 99/1 (!!) in some rare cases (see below for example). I find this > edge case quite interesting, especially since I am able to so easily > reproduce it on all machines and all (4.8+) kernels. > > The issue can be easily reproduced on all kernels 4.8+, and afaik. most > "container runtimes" are affected. I do believe that I have wondered about > this issue earlier, but not actually understood what was causing it. There > is really (afaik.) no way to find the issue from userspace without using > a tracing tool like bpftrace. I am _sure_ this affects real production > environments _everywhere_, at least those running the normal container/using > cgroup tools together with cpu pinning; but verifying that the > fairness is skewed is quite hard, or more or less impossible in such cases, > especially when running multiple workloads simultaneously. I did inspect some > of my production servers, and found a set of this on multiple cgroups. > > It _is_ possible to read '/proc/sched_debug', but since that list only prints > the cfs_rq's from "leaf_cfs_rq_list", the cfs_rq's we are _really_ looking for, > are missing; making it look like the problem isn't becuase of cfs. > > I think the current patch is the best way solve the issue, but I will happily > discuss other possible solutions. One could also just "don't" propagate the > load to the task_group, but I think the current implementation on that part > is the correct way to do it. This is at least my best solution to avoid adding > logic to a "hotter" path that also would require more logic. > > The only thing I am thinking a bit about is if the cfs_rq (or one of its ancestors) is > throttled. In case the cfs_rq->throttled==1, I don't think we currently can skip, since > it only adds to the leaf_list in case (cfs_rq->nr_running >= 1), and the same applies for > (cfs_rq->throttle_count >= 1), since we cannot guarantee that it will be added again. Will > this code however interfere with the throttling mechanism? I have tested inserting new procs > to throttled cgroups (or children cgroups of throttled ones) with the patch applied, > and it works as I expect it to do (but other people might have more insight). > > Also, another solution may be to don't add the load before the task is actually > enqueued, avoiding this issue all together. But is that better (I assume that will > be more in the "hot path", and require quite a bit more code)? > > There may definetly be a better solution to this that I still haven't found, and that > other more experienced kernel devs see right away. :) > > > Also feel free to suggest additions/wording of the patch > message and title, and/or the comment in the code to make it more clear. > > This issue was introduced in 4.8, so all stable (4.9+) releases should probably > get this (the final solution at least), or? > Have you been able to reproduce this on mainline ? > > > Below is various ways/scripts to reproduce - sorry this is so long (and > sorry for bash in general), but thought people might be interested in them: > > note: due to the nature of the issue, the "lost" load is different each time, so > the values change each time, and sometimes/often end up at ~50/50; but my testing > shows that it keep happening almost every time: > > > Example on cgruoup v1. Often results in 60/40 load: > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND > root 18026 63.0 0.0 3676 100 pts/7 R+ 13:09 0:06 stress --cpu 1 > root 18036 36.6 0.0 3676 100 pts/7 R+ 13:09 0:04 stress --cpu 1 When running the script below on v5.12, I'm not able to reproduce your problem > --- bash start > CGROUP_CPU=/sys/fs/cgroup/cpu/slice > CGROUP_CPUSET=/sys/fs/cgroup/cpuset/slice > CPU=0 > > function run_sandbox { > local CG_CPUSET="$1" > local CG_CPU="$2" > local CMD="$3" > > local PIPE="$(mktemp -u)" > mkfifo "$PIPE" > sh -c "read < $PIPE ; exec $CMD" & > local TASK="$!" > sleep .1 > mkdir -p "$CG_CPUSET" > mkdir -p "$CG_CPU" > tee "$CG_CPU"/cgroup.procs <<< "$TASK" > > tee "$CG_CPUSET"/cgroup.procs <<< "$TASK" > > tee "$PIPE" <<< sandox_done > rm "$PIPE" > } > > mkdir -p "$CGROUP_CPU" > mkdir -p "$CGROUP_CPUSET" > tee "$CGROUP_CPUSET"/cpuset.cpus <<< "0" > tee "$CGROUP_CPUSET"/cpuset.mems <<< "0" > > run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-1" "stress --cpu 1" > run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-2" "stress --cpu 1" > > read # click enter to cleanup > killall stress > sleep .2 > rmdir /sys/fs/cgroup/cpuset/slice/ > rmdir /sys/fs/cgroup/cpu/slice/{cg-1,cg-2,} > --- bash end > > Example on cgroup v2 with sub cgroup (same as described in commit message), > where both should get 50/50, but instead getting 99% and 1% (!!). > > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND > root 18568 1.1 0.0 3684 100 pts/12 R+ 13:36 0:00 stress --cpu 1 > root 18580 99.3 0.0 3684 100 pts/12 R+ 13:36 0:09 stress --cpu 1 > > (in case of systemd on vg v2- make sure some slice/scope is delegated(?)/use > cpusets, otherwise systemd will fight you) > --- bash start > CGROUP_CPU=/sys/fs/cgroup/cpu/slice > CGROUP_CPUSET=/sys/fs/cgroup/cpuset/slice > CPU=0 > > function run_sandbox { > local CG_CPUSET="$1" > local CG_CPU="$2" > local CMD="$3" > > local PIPE="$(mktemp -u)" > mkfifo "$PIPE" > sh -c "read < $PIPE ; exec $CMD" & > local TASK="$!" > sleep .01 > mkdir -p "$CG_CPUSET" > mkdir -p "$CG_CPU" > tee "$CG_CPU"/cgroup.procs <<< "$TASK" > > tee "$CG_CPUSET"/cgroup.procs <<< "$TASK" > > tee "$PIPE" <<< sandox_done > rm "$PIPE" > } > > mkdir -p "$CGROUP_CPU" > mkdir -p "$CGROUP_CPUSET" > tee "$CGROUP_CPUSET"/cpuset.cpus <<< "0" > tee "$CGROUP_CPUSET"/cpuset.mems <<< "0" > > run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-1" "stress --cpu 1" > run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-2" "stress --cpu 1" > > read # click enter to cleanup > killall stress > sleep .2 > rmdir /sys/fs/cgroup/cpuset/slice/ > rmdir /sys/fs/cgroup/cpu/slice/{cg-1,cg-2,} > --- bash end > > > > For those who only want to run docker stuff: > (on cgroup v1, you must use systemd driver) > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND > root 9291 60.1 0.0 7320 96 pts/0 R+ 13:18 0:07 /usr/bin/stress --verbose --cpu 1 > root 9388 45.5 0.0 7320 96 pts/0 R+ 13:18 0:04 /usr/bin/stress --verbose --cpu 1 > --- bash start > docker run --cpuset-cpus=0 --rm -it an-image-with-stress > docker run --cpuset-cpus=0 --rm -it an-image-with-stress > --- bash end > > > Here is a bpftrace script that can show what is happening (again. sorry for bash, > but bpftrace only allow constants as array indexes as of now, so this is the best I can do). > This script also needs bpftrace v0.13 (not released as of now, so currently you have > to compile master) or newer. > > > --- bash start > PROBE='kfunc:sched_group_set_shares{ > printf("cgroup: %s/%s\n", > str(args->tg->css.cgroup->kn->parent->name), > str(args->tg->css.cgroup->kn->name)); > printf( > "cpu load.weight avg.load_avg removed.load_avg removed.nr removed.on_list tg_load_avg_contrib tg->load_avg\n" > ); > }' > for i in $(seq 0 $(($(nproc)-1))); do > PROBE="$PROBE""$(sed "s/cpu_nr/$i/" <<<' kfunc:sched_group_set_shares{ > printf("%-4d %-12llu %-13llu %-17llu %-11d %-16d %-20llu %d\n", > cpu_nr, > (args->tg->cfs_rq[cpu_nr])->load.weight, > (args->tg->cfs_rq[cpu_nr])->avg.load_avg, > (args->tg->cfs_rq[cpu_nr])->removed.load_avg, > (args->tg->cfs_rq[cpu_nr])->removed.nr, > (args->tg->cfs_rq[cpu_nr])->on_list, > (args->tg->cfs_rq[cpu_nr])->tg_load_avg_contrib, > args->tg->load_avg.counter > ); > }')" > done > PROBE="$PROBE"'kfunc:sched_group_set_shares{ > printf("\n"); > }' > > bpftrace -e "$PROBE" > --- bash end > > > When running the bpftrace script when the sub cgroup example is running, and > executing (just setting the weight of the cgroup the same value as before, no change): > > --- bash start > tee /sys/fs/cgroup/slice/cg-1/sub/cpu.weight <<< 1 > tee /sys/fs/cgroup/slice/cg-2/sub/cpu.weight <<< 10000 > tee /sys/fs/cgroup/slice/cg-1/cpu.weight <<< 100 > tee /sys/fs/cgroup/slice/cg-2/cpu.weight <<< 100 > --- bash end > > the output is: > > --- output start > Attaching 6 probes... > cgroup: cg-1/sub > cpu load.weight avg.load_avg removed.load_avg removed.nr removed.on_list tg_load_avg_contrib tg->load_avg > 0 1048576 1023 0 0 1 1034 1662 > 1 0 0 0 0 0 0 1662 > 2 0 0 0 0 0 0 1662 > 3 0 628 628 1 0 628 1662 > > cgroup: cg-2/sub > cpu load.weight avg.load_avg removed.load_avg removed.nr removed.on_list tg_load_avg_contrib tg->load_avg > 0 1048576 1023 0 0 1 1023 1830 > 1 0 0 0 0 0 0 1830 > 2 0 0 0 0 0 0 1830 > 3 0 807 807 1 0 807 1830 > > cgroup: slice/cg-1 > cpu load.weight avg.load_avg removed.load_avg removed.nr removed.on_list tg_load_avg_contrib tg->load_avg > 0 6347 5 0 0 1 5 593 > 1 0 0 0 0 0 0 593 > 2 0 0 0 0 0 0 593 > 3 0 5 0 0 0 588 593 > > cgroup: slice/cg-2 > cpu load.weight avg.load_avg removed.load_avg removed.nr removed.on_list tg_load_avg_contrib tg->load_avg > 0 58642371 57263 0 0 1 57909 58665 > 1 0 0 0 0 0 0 58665 > 2 0 0 0 0 0 0 58665 > 3 0 75615 0 0 0 756 58665 > > --- output end > > We can clearly see that both cg-1/sub and cg-2/sub have removed.nr==1 on cpu 3, > and therefore still contribute to the tg->load_avg. Since removed.on_list==0, the load would > never be "cleaned" up unless a new task starts on that cpu, but due to the cpuset, that would not > be the case. slice/cg-1 and slice/cg-2 also have load attached to cpu 3 that isn't removed (but > the patch will properly decay the load on them as well). > > With this path, all of these examples just end up sharing cpu time in a fair 50/50 way, as expected. > > Odin Ugedal (1): > sched/fair: Fix unfairness caused by missing load decay > > kernel/sched/fair.c | 13 +++++++++++++ > 1 file changed, 13 insertions(+) > > -- > 2.31.1 >