Re: [PATCH 0/1] sched/fair: Fix unfairness caused by missing load decay

Odin Ugedal <odin@xxxxxxxxxx> · Tue, 27 Apr 2021 13:24:00 +0200

Hi,

> I wanted to say one v5.12-rcX version to make sure this is still a
> valid problem on latest version

Ahh, I see. No problem. :) Thank you so much for taking the time to
look at this!

> I confirm that I can see a ratio of 4ms vs 204ms running time with the
> patch below.

(I assume you talk about the bash code for reproducing, not the actual
sched patch.)

> But when I look more deeply in my trace (I have
> instrumented the code), it seems that the 2 stress-ng don't belong to
> the same cgroup but remained in cg-1 and cg-2 which explains such
> running time difference.

(mail reply number two to your previous mail might also help surface it)

Not sure if I have stated it correctly, or if we are talking about the
same thing. It _is_ the intention that the two procs should not be in the
same cgroup. In the same way as people create "containers", each proc runs
in a separate cgroup in the example. The issue is not the balancing
between the procs
themselves, but rather cgroups/sched_entities inside the cgroup hierarchy.
(due to the fact that the vruntime of those sched_entities end up
being calculated with more load than they are supposed to).

If you have any thought about the phrasing of the patch itself to make it
easier to understand, feel free to suggest.

Given the last cgroup v1 script, I get this:

- cat /proc/<stress-pid-1>/cgroup | grep cpu
11:cpu,cpuacct:/slice/cg-1/sub
3:cpuset:/slice

- cat /proc/<stress-pid-2>/cgroup | grep cpu
11:cpu,cpuacct:/slice/cg-2/sub
3:cpuset:/slice

The cgroup hierarchy will then roughly be like this (using cgroup v2 terms,
becuase I find them easier to reason about):

slice/
  cg-1/
    cpu.shares: 100
    sub/
      cpu.weight: 1
      cpuset.cpus: 1
      cgroup.procs - stress process 1 here
  cg-2/
    cpu.weight: 100
    sub/
      cpu.weight: 10000
      cpuset.cpus: 1
      cgroup.procs - stress process 2 here

This should result in 50/50 due to the fact that cg-1 and cg-2 both have a
weight of 100, and "live" inside the /slice cgroup. The inner weight should not
matter, since there is only one cgroup at that level.

> So your script doesn't reproduce the bug you
> want to highlight. That being said, I can also see a diff between the
> contrib of the cpu0 in the tg_load. I'm going to look further

There can definitely be some other issues involved, and I am pretty sure
you have way more knowledge about the scheduler than me... :) However,
I am pretty sure that it is in fact showing the issue I am talking about,
and applying the patch does indeed make it impossible to reproduce it
on my systems.

Odin