Re: [PATCH] sched/fair: Introduce priority load balance for CFS

Song Zhang <zhangsong34@xxxxxxxxxx> · Thu, 17 Nov 2022 17:07:37 +0800

On 2022/11/16 22:38, Vincent Guittot wrote:
On Wed, 16 Nov 2022 at 08:37, Song Zhang <zhangsong34@xxxxxxxxxx> wrote:

On 2022/11/15 15:18, Vincent Guittot wrote:
On Mon, 14 Nov 2022 at 17:42, Vincent Guittot
<vincent.guittot@xxxxxxxxxx> wrote:

On Sat, 12 Nov 2022 at 03:51, Song Zhang <zhangsong34@xxxxxxxxxx> wrote:

Hi, Vincent

On 2022/11/3 17:22, Vincent Guittot wrote:
On Thu, 3 Nov 2022 at 10:20, Song Zhang <zhangsong34@xxxxxxxxxx> wrote:

On 2022/11/3 16:33, Vincent Guittot wrote:
On Thu, 3 Nov 2022 at 04:01, Song Zhang <zhangsong34@xxxxxxxxxx> wrote:

Thanks for your reply!

On 2022/11/3 2:01, Vincent Guittot wrote:
On Wed, 2 Nov 2022 at 04:54, Song Zhang <zhangsong34@xxxxxxxxxx> wrote:

This really looks like a v3 of
https://lore.kernel.org/all/20220810015636.3865248-1-zhangsong34@xxxxxxxxxx/

Please keep versioning.

Add a new sysctl interface:
/proc/sys/kernel/sched_prio_load_balance_enabled

We don't want to add more sysctl knobs for the scheduler, we even
removed some. Knob usually means that you want to fix your use case
but the solution doesn't make sense for all cases.

OK, I will remove this knobs later.

0: default behavior
1: enable priority load balance for CFS

For co-location with idle and non-idle tasks, when CFS do load balance,
it is reasonable to prefer migrating non-idle tasks and migrating idle
tasks lastly. This will reduce the interference by SCHED_IDLE tasks
as much as possible.

I don't agree that it's always the best choice to migrate a non-idle task 1st.

CPU0 has 1 non idle task and CPU1 has 1 non idle task and hundreds of
idle task and there is an imbalance between the 2 CPUS: migrating the
non idle task from CPU1 to CPU0 is not the best choice

If the non idle task on CPU1 is running or cache hot, it cannot be
migrated and idle tasks can also be migrated from CPU1 to CPU0. So I
think it does not matter.

What I mean is that migrating non idle tasks first is not a universal
win and not always what we want.

But migrating online tasks first is mostly a trade-off that
non-idle(Latency Sensitive) tasks can obtain more CPU time and minimize
the interference caused by IDLE tasks. I think this makes sense in most
cases, or you can point out what else I need to think about it ?

Best regards.

Testcase:
- Spawn large number of idle(SCHED_IDLE) tasks occupy CPUs

What do you mean by a large number ?

- Let non-idle tasks compete with idle tasks for CPU time.

Using schbench to test non-idle tasks latency:
$ ./schbench -m 1 -t 10 -r 30 -R 200

How many CPUs do you have ?

OK, some details may not be mentioned.
My virtual machine has 8 CPUs running with a schbench process and 5000
idle tasks. The idle task is a while dead loop process below:

How can you care about latency when you start 10 workers on 8 vCPUs
with 5000 non idle threads ?

No no no... spawn 5000 idle(SCHED_IDLE) processes not 5000 non-idle
threads, and with 10 non-idle schbench workers on 8 vCPUs.

yes spawn 5000 idle tasks but my point remains the same

I am so sorry that I have not received your reply for a long time, and I
am still waiting for it anxiously. In fact, migrating non-idle tasks 1st
works well in most scenarios, so it maybe possible to add a
sched_feat(LB_PRIO) to enable or disable that. Finally, I really hope
you can give me some better advice.

I have seen that you posted a v4 5 days ago which is on my list to be reviewed.

My concern here remains that selecting non idle task 1st is not always
the best choices as for example when you have 1 non idle task per cpu
and thousands of idle tasks moving around. Then regarding your use
case, the weight of the 5000 idle threads is around twice more than
the weight of your non idle bench: sum weight of idle threads is 15k
whereas the weight of your bench is around 6k IIUC how RPS run. This
also means that the idle threads will take a significant times of the
system: 5000 / 7000 ticks. I don't understand how you can care about
latency in such extreme case and I'm interested to get the real use
case where you can have such situation.

All that to say that idle task remains cfs task with a small but not
null weight and we should not make them special other than by not
preempting at wakeup.

Also, as mentioned for a previous version, a task with nice prio 19
has a weight of 15 so if you replace the 5k idle threads with 1k cfs
w/ nice prio 19 threads, you will face a similar problem. So you can't
really care only on the idle property of a task

Well, my original idea was to consider interference between tasks of
different priorities when doing CFS load balancing to ensure that
non-idle tasks get more CPU scheduler time without changing the native
CFS load balancing policy.

Consider a simple scenario. Assume that CPU 0 has two non-idle tasks
whose weight is 1024 * 2 = 2048, also CPU 0 has 1000 idle tasks whose
weight is 1K x 15 = 15K. CPU 1 is idle. Therefore, IDLE load balance is

weight of cfs idle thread is 3, the weight of cfs nice 19 thread is 15

yes, idle weight is 3, thanks for your pointing out.

triggered. CPU 1 needs to pull a certain number of tasks from CPU 0. If
we do not considerate task priorities and interference between tasks,
more than 600 idle tasks on CPU 0 may be migrated to CPU 1. As a result,
two non-idle tasks still compete on CPU 0. However, CPU 1 is running
with all idle but not non-idle tasks.

Let's calculate the percentage of CPU time gained by non-idle tasks in a
scheduling period:

CPU 1: time_percent(non-idle tasks) = 0
CPU 0: time_percent(non-idle tasks) = 2048 * 2 / (2048 + 15000) = 24%

2 cfs task nice 0 with 1000 cfs idle tasks on 2 CPUs. The weight of
the system is:

2*1024 + 1000*3 = 5048 or  2524 per CPU

This means that the cfs nice 0 task should get 1024/(5048) = 20% of
system time which means 40% of CPUs time.

This also means that the 2 cfs tasks on CPU0 is a valid configuration
as they will both have their 40% of CPUs

If you increase idle task number to 3000, the cfs nice 0 task only get 
1024 / (2 * 1024 + 3000 * 3) = 9.3% of system time.

But if we can first migrate one cfs nice 0 task to CPU 1, the cfs nice 0 
task maybe execute quickly on CPU 1, then CPU 1 is got to idle and pulls 
more idle tasks from CPU 0, so that the cfs nice 0 task on CPU 0 can 
also be completed more quickly.

cfs idle threads have a small weight to be negligible compared to
"normal" threads so they can't normally balance a system by themself
but by spawning 1000+ cfs idle threads, you make them not negligible
anymore. That's the root of your problem. A CPU with only cfs idle
tasks should be seen unbalanced compared to other CPUs with non idle
tasks and this is what is happening with small/normal number of cfs
idle threads

If we do not consider putting all low-priority tasks to a cgroup with a 
minimum cpu shares and only set per-task scheduler policy to SCHED_IDLE, 
the weight of a large number of idle tasks cannot be ignored.

On the other hand, if we consider the interference between different
task priorities, we change the migration policy to firstly migrate an
non-idle task on CPU 0 to CPU 1. Migrating idle tasks on CPU 0 maybe
interfered with the non-idle task on CPU 1. So we decide to migrate idle
tasks on CPU 0 after non-idle tasks on CPU 1 are completed or exited.

Now the percentage of the CPU time obtained by the non-idle tasks in a
scheduling period is as follows:

CPU 1: time_percent(non-idle tasks) = 1024 / 1024 = 100%
CPU 0: time_percent(non-idle tasks) = 1024 / (1024 + 15000) = 6.4%

But this is unfair for one cfs nice 0 thread and all cfs idle threads

This unfairness may be short-lived, because as soon as CPU 1 go to idle 
again, CPU 1 immediately pulls more idle tasks from CPU 0 to accelerate 
the running of non-idle tasks on CPU 0.

Obviously, if load balance migration tasks prefer migrate non-idle tasks
and suppress the interference of idle tasks migration on non-idle tasks,
the latency of non-idle tasks can be significantly reduced. Although
this will cause some idle tasks imbalance between different CPUs and
reduce throughput of idle tasks., I think this strategy is feasible in
some real-time business scenarios for latency tasks.

But idle cfs ask remains cfs task and we keep cfs fairness for all threads

Have you tried to :
- Increase nice priority of the non idle cfs task so the sum of the
weight of idle tasks remain a small portion of the total weight ?
- to put your thousands idle tasks in a cgroup and set cpu.idle for
this cgroup. This should also ensure that the weight of idle threads
remains negligible compared to others.

I have tried both setup in my local system and I have 1 non idle task per CPU

Regards,
Vincent

yes I have tried to do them and the results are as expected.

But...as you mentioned above, if all idle tasks are placed in a cgroup 
with the minimum cpu shares or increase nice priority of non-idle tasks, 
the weight of idle tasks is negligible compared with that of non-idle 
tasks, this does not affect the final result.

However, if we try only consider changing the scheduler policy of idle 
tasks to SCHED_IDLE and do not want to modify nice priority of non-idle 
tasks, the weight of idle tasks and the interference on non-idle tasks 
needs to be reconsidered when tasks migration between CPUs.

Best Regards,
Song Zhang

Best regards.

Song Zhang
.
.