On 2022/3/9 01:13, Tejun Heo wrote:
Hello,
On Tue, Mar 08, 2022 at 05:26:25PM +0800, Tianchen Ding wrote:
Modern platform are growing fast on CPU numbers. To achieve better
utility of CPU resource, multiple apps are starting to sharing the CPUs.
What we need is a way to ease confliction in share mode,
make groups as exclusive as possible, to gain both performance
and resource efficiency.
The main idea of group balancer is to fulfill this requirement
by balancing groups of tasks among groups of CPUs, consider this
as a dynamic demi-exclusive mode. Task trigger work to settle it's
group into a proper partition (minimum predicted load), then try
migrate itself into it. To gradually settle groups into the most
exclusively partition.
GB can be seen as an optimize policy based on load balance,
it obeys the main idea of load balance and makes adjustment
based on that.
Our test on ARM64 platform with 128 CPUs shows that,
throughput of sysbench memory is improved about 25%,
and redis-benchmark is improved up to about 10%.
The motivation makes sense to me but I'm not sure this is the right way to
architecture it. We already have the framework to do all these - the sched
domains and the load balancer. Architecturally, what the suggested patchset
is doing is building a separate load balancer on top of cpuset after using
cpuset to disable the existing load balancer, which is rather obviously
convoluted.
"the sched domains and the load balancer" you mentioned are the ways to
"balance" tasks on each domains. However, this patchset aims to "group"
them together to win hot cache and less competition, which is different
from load balancer. See commit log of the patch 3/4 and this link:
https://lore.kernel.org/all/11d4c86a-40ef-6ce5-6d08-e9d0bc9b512a@xxxxxxxxxxxxxxxxx/
* AFAICS, none of what the suggested code does is all that complicated or
needs a lot of input from userspace. it should be possible to parametrize
the existing load balancer to behave better.
Group balancer mainly needs 2 inputs from userspace: cpu partition info
and cgroup info.
Cpu partition info does need user input (and maybe a bit complicated).
As a result, the division methods are __free__ to users(can refer to
NUMA nodes, clusters, cache, etc.)
Cgroup info doesn't need extra input. It's naturally configured.
It do parametrize the existing load balancer to behave better.
Group balancer is a kind of optimize policy, and should obey the basic
policy (load balance) and improve it.
The relationship between load balancer and group balancer is explained
in detail at the above link.
* If, for some reason, you need more customizable behavior in terms of cpu
allocation, which is what cpuset is for, maybe it'd be better to build the
load balancer in userspace. That'd fit way better with how cgroup is used
in general and with threaded cgroups, it should fit nicely with everything
else.
We put group balancer in kernel space because this new policy does not
depend on userspace apps. It's a "general" feature.
Doing "dynamic cpuset" in userspace may also introduce performance
issue, since it may need to bind and unbind different cpusets for
several times, and is too strict(compared with our "soft bind").