Re: [PATCH 1/2] cgroup/cpuset: Keep current cpus list if cpus affinity was explicitly set

Waiman Long <longman@xxxxxxxxxx> · Fri, 29 Jul 2022 14:31:00 -0400

On 7/29/22 10:50, Waiman Long wrote:
On 7/29/22 10:15, Valentin Schneider wrote:
On 28/07/22 11:39, Tejun Heo wrote:
Hello, Waiman.

On Thu, Jul 28, 2022 at 05:04:19PM -0400, Waiman Long wrote:
So, the patch you proposed is making the code remember one special 
aspect of
user requested configuration - whether it configured it or not, 
and trying
to preserve that particular state as cpuset state changes. It 
addresses the
immediate problem but it is a very partial approach. Let's say a 
task wanna
be affined to one logical thread of each core and set its mask to 
0x5555.
Now, let's say cpuset got enabled and enforced 0xff and affined 
the task to
0xff. After a while, the cgroup got more cpus allocated and its 
cpuset now
has 0xfff. Ideally, what should happen is the task now having the 
effective
mask of 0x555. In practice, tho, it either would get 0xf55 or 0x55 
depending
on which way we decide to misbehave.
OK, I see what you want to accomplish. To fully address this issue, 
we will
need to have a new cpumask variable in the the task structure which 
will be
allocated if sched_setaffinity() is ever called. I can rework my 
patch to
use this approach.
Yeah, we'd need to track what user requested separately from the 
currently
effective cpumask. Let's make sure that the scheduler folks are on 
board
before committing to the idea tho. Peter, Ingo, what do you guys think?

FWIW on a runtime overhead side of things I think it'll be OK as that
should be just an extra mask copy  in sched_setaffinity() and a subset
check / cpumask_and() in set_cpus_allowed_ptr(). The policy side is a 
bit
less clear (when, if ever, do we clear the user-defined mask? Will it 
keep
haunting us even after moving a task to a disjoint cpuset partition?).

The runtime overhead should be minimal. It is the behavioral side that 
we should be careful about. It is a change in existing behavior and we 
don't want to cause surprise to the users. Currently, a task that set 
its cpu affinity explicitly will have its affinity reset whenever 
there is any change to the cpuset it belongs to or a hotplug event 
touch any cpu in the current cpuset. The new behavior we are proposing 
here is that it will try its best to keep the cpu affinity that the 
user requested within the constraint of the current cpuset as well as 
the cpu hotplug state.

There's also if/how that new mask should be exposed, because attaching a
task to a cpuset will now yield a not-necessarily-obvious affinity -
e.g. in the thread affinity example above, if the initial affinity 
setting
was done ages ago by some system tool, IMO the user needs a way to be 
able
to expect/understand the result of 0x555 rather than 0xfff.

Users can use sched_getaffinity(2) to retrieve the current cpu 
affinity. It is up to users to set another one if they don't like the 
current one. I don't think we need to return what the previous 
requested cpu affinity is. They are suppose to know that or they can 
set their own if they don't like it. \

Looking at Will's series that introduced user_cpus_ptr, I think we can 
overlay our proposal on top of that. So calling sched_setaffinity() will 
also update user_cpus_ptr. We may still need a flag to indicate whether 
user_cpus_ptr is set up because of sched_setaffinity() or due to a call 
to force_compatible_cpus_allowed_ptr() from arm64 arch code. That will 
make our work easier as some of the infrastructure is already there. I 
am looking forward for your feedback.

Thanks,
Longman