On 2/3/23 06:50, Will Deacon wrote:
On Wed, Feb 01, 2023 at 10:34:00PM -0500, Waiman Long wrote:
On 2/1/23 16:10, Peter Zijlstra wrote:
On Wed, Feb 01, 2023 at 01:46:11PM -0500, Waiman Long wrote:
Note that using cpus_allowed directly in cgroup v2 may not be right because
cpus_allowed may have no relationship to effective_cpus at all in some
cases, e.g.
root
|
V
A (cpus_allowed = 1-4, effective_cpus = 1-4)
|
V
B (cpus_allowed = 5-8, effective_cpus = 1-4)
In the case of cpuset B, passing back cpus 5-8 as the allowed_cpus is wrong.
I think my patch as written does the right thing here. Since the
intersection of (1-4) and (5-8) is empty it will move up the hierarchy
and we'll end up with (1-4) from the cgroup side of things.
So the purpose of __cs_cpus_allowed() is to override the cpus_allowed of
the root set and force it to cpu_possible_mask.
Then cs_cpus_allowed() computes the intersection of cs->cpus_allowed and
all it's parents. This will, in the case of B above, result in the empty
mask.
Then cpuset_cpus_allowed() has a loop that starts with
task_cpu_possible_mask(), intersects that with cs_cpus_allowed() and if
the intersection of that and cpu_online_mask is empty, moves up the
hierarchy. Given cs_cpus_allowed(B) is the empty mask, we'll move to A.
Note that since we force the mask of root to cpu_possible_mask,
cs_cpus_allowed(root) will be a no-op and if we guarantee (in arch code)
that cpu_online_mask always has a non-empty intersection with
task_cpu_possible_mask(), this loop is guaranteed to terminate with a
viable mask.
I will take a closer look at that tomorrow. I will be more comfortable
ack'ing that if this is specific to v1 cpuset instead of applying this in
both v1 and v2 since it is only v1 that is problematic.
fwiw, the regression I'm seeing is with cgroup2. I haven't tried v1.
I think I know where the problem is. It is due to the fact the cpuset
hotplug code doesn't update cpumasks of the tasks in the top cpuset
(root) at all when there is a cpu offline or online event. It is
probably because for some of the tasks in the top cpuset, especially the
percpu kthread, changing their cpumasks can be catastrophic. The hotplug
code does update the cpumasks of the tasks that are not in the top
cpuset. This problem is irrespective of whether v1 or v2 is in use.
The partition code does try to update the cpumasks of the tasks in the
top cpuset, but skip the percpu kthreads. My testing show that is
probably OK. For safety, I agree that is better to extend the allowed
cpu list to all the possible cpus (including offline ones) for now until
more testings are done to find a safe way to do that. That special case
should apply only to tasks in the top cpuset. For the rests, the current
code should be OK and is less risky than adopting code in this patch.
Cheers,
Longman