Re: [PATCH 1/2] cpuset: Fix cpuset_cpus_allowed() to not filter offline CPUs

Waiman Long <longman@xxxxxxxxxx> · Wed, 1 Feb 2023 14:14:35 -0500

On 2/1/23 13:46, Waiman Long wrote:
On 2/1/23 10:16, Waiman Long wrote:
On 2/1/23 04:14, Peter Zijlstra wrote:
On Tue, Jan 31, 2023 at 11:14:27PM -0500, Waiman Long wrote:
On 1/31/23 17:17, Will Deacon wrote:
From: Peter Zijlstra <peterz@xxxxxxxxxxxxx>

There is a difference in behaviour between CPUSET={y,n} that is now
wrecking havoc with {relax,force}_compatible_cpus_allowed_ptr().

Specifically, since commit 8f9ea86fdf99 ("sched: Always preserve the
user requested cpumask") relax_compatible_cpus_allowed_ptr() is
calling __sched_setaffinity() unconditionally.

But the underlying problem goes back a lot further, possibly to
commit: ae1c802382f7 ("cpuset: apply cs->effective_{cpus,mems}") 
which
switched cpuset_cpus_allowed() from cs->cpus_allowed to
cs->effective_cpus.

The problem is that for CPUSET=y cpuset_cpus_allowed() will filter 
out
all offline CPUs. For tasks that are part of a (!root) cpuset this is
then later fixed up by the cpuset hotplug notifiers that re-evaluate
and re-apply cs->effective_cpus, but for (normal) tasks in the root
cpuset this does not happen and they will forever after be excluded
from CPUs onlined later.

As such, rewrite cpuset_cpus_allowed() to return a wider mask,
including the offline CPUs.

Fixes: 8f9ea86fdf99 ("sched: Always preserve the user requested 
cpumask")
Reported-by: Will Deacon <will@xxxxxxxxxx>
Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
Link: 
https://lkml.kernel.org/r/20230117160825.GA17756@willie-the-truck
Signed-off-by: Will Deacon <will@xxxxxxxxxx>
Before cgroup v2, cpuset had only one cpumask - cpus_allowed. It only
tracked online cpus and ignored the offline ones. It behaves more like
effective_cpus in cpuset v2. With v2, we have 2 cpumasks - 
cpus_allowed and
effective_cpus. When cpuset v1 is mounted, cpus_allowed and 
effective_cpus
are effectively the same and track online cpus. With cpuset v2, 
cpus_allowed
contains what the user has written into and it won't be changed until
another write happen. However, what the user written may not be 
what the
system can give it and effective_cpus is what the system decides a 
cpuset
can use.

Cpuset v2 is able to handle hotplug correctly and update the task's 
cpumask
accordingly. So missing previously offline cpus won't happen with v2.

Since v1 keeps the old behavior, previously offlined cpus are lost 
in the
cpuset's cpus_allowed. However tasks in the root cpuset will still 
be fine
with cpu hotplug as its cpus_allowed should track cpu_online_mask. 
IOW, only
tasks in a non-root cpuset suffer this problem.

It was a known issue in v1 and I believe is one of the major 
reasons of the
cpuset v2 redesign.

A major concern I have is the overhead of creating a poor man 
version of v2
cpus_allowed. This issue can be worked around even for cpuset v1 if 
it is
mounted with the cpuset_v2_mode option to behave more like v2 in 
its cpumask
handling. Alternatively we may be able to provide a config option 
to make
this the default for v1 without the special mount option, if 
necessary.
You're still not getting it -- even cpuset (be it v1 or v2) *MUST* 
*NOT*
mask offline cpus for root cgroup tasks, ever. (And the only reason it
gets away with masking offline for !root is that it re-applies the mask
every time it changes.)

Yes it did that for a fair while -- but it is wrong and broken and a
very big behavioural difference between CONFIG_CPUSET={y,n}. This must
not be.

Arguably cpuset-v2 is still wrong for masking offline cpus in it's
effective_cpus mask, but I really didn't want to go rewrite cpuset.c 
for
something that needs to go into /urgent *now*.

Hence this minimal patch that at least lets sched_setaffinity() work as
intended.

I don't object to the general idea of keeping offline cpus in a 
task's cpu affinity. In the case of cpu offline event, we can skip 
removing that offline cpu from the task's cpu affinity. That will 
partially solve the problem here and is also simpler.

I believe a main reason why effective_cpus holds only online cpus is 
because of the need to detect when there is no online cpus available 
in a given cpuset. In this case, it will fall back to the nearest 
ancestors with online cpus.

This offline cpu problem with cpuset v1 is a known problem for a long 
time. It is not a recent regression.

Note that using cpus_allowed directly in cgroup v2 may not be right 
because cpus_allowed may have no relationship to effective_cpus at all 
in some cases, e.g.

   root
    |
    V
    A (cpus_allowed = 1-4, effective_cpus = 1-4)
    |
    V
    B (cpus_allowed = 5-8, effective_cpus = 1-4)

In the case of cpuset B, passing back cpus 5-8 as the allowed_cpus is 
wrong.

I wonder how often is cpu hotplug happening in those arm64 cpu systems 
that only have a subset of cpus that can run 32-bit programs.

One possible solution is to use cpuset_cpus_allowed_fallback() in case 
none of the cpus in the current cpuset is allowed to be used to run a 
given task.

Cheers,
Longman