On Tue, Apr 26, 2022 at 10:34:21PM -0400, Waiman Long wrote: > On 4/26/22 21:06, Feng Tang wrote: > > On Tue, Apr 26, 2022 at 10:58:21PM +0800, Waiman Long wrote: > > > On 4/25/22 23:23, Feng Tang wrote: > > > > Hi Waiman, > > > > > > > > On Mon, Apr 25, 2022 at 11:55:05AM -0400, Waiman Long wrote: > > > > > There are 3 places where the cpu and node masks of the top cpuset can > > > > > be initialized in the order they are executed: > > > > > 1) start_kernel -> cpuset_init() > > > > > 2) start_kernel -> cgroup_init() -> cpuset_bind() > > > > > 3) kernel_init_freeable() -> do_basic_setup() -> cpuset_init_smp() > > > > > > > > > > The first cpuset_init() function just sets all the bits in the masks. > > > > > The last one executed is cpuset_init_smp() which sets up cpu and node > > > > > masks suitable for v1, but not v2. cpuset_bind() does the right setup > > > > > for both v1 and v2. > > > > > > > > > > For systems with cgroup v2 setup, cpuset_bind() is called once. For > > > > > systems with cgroup v1 setup, cpuset_bind() is called twice. It is > > > > > first called before cpuset_init_smp() in cgroup v2 mode. Then it is > > > > > called again when cgroup v1 filesystem is mounted in v1 mode after > > > > > cpuset_init_smp(). > > > > > > > > > > [ 2.609781] cpuset_bind() called - v2 = 1 > > > > > [ 3.079473] cpuset_init_smp() called > > > > > [ 7.103710] cpuset_bind() called - v2 = 0 > > > > I run some test, on a server with centOS, this did happen that > > > > cpuset_bind() is called twice, first as v2 during kernel boot, > > > > and then as v1 post-boot. > > > > > > > > However on a QEMU running with a basic debian rootfs image, > > > > the second call of cpuset_bind() didn't happen. > > > The first time cpuset_bind() is called in cgroup_init(), the kernel > > > doesn't know if userspace is going to mount v1 or v2 cgroup. By default, > > > it is assumed to be v2. However, if userspace mounts the cgroup v1 > > > filesystem for cpuset, cpuset_bind() will be run at this point by > > > rebind_subsystem() to set up cgroup v1 environment and > > > cpus_allowed/mems_allowed will be correctly set at this point. Mounting > > > the cgroup v2 filesystem, however, does not cause rebind_subsystem() to > > > run and hence cpuset_bind() is not called again. > > > > > > Is the QEMU setup not mounting any cgroup filesystem at all? If so, does > > > it matter whether v1 or v2 setup is used? > > When I got the cpuset binding error report, I tried first on qemu to > > reproduce and failed (due to there was no memory hotplug), then I > > reproduced it on a real server. For both system, I used "cgroup_no_v1=all" > > cmdline parameter to test cgroup-v2, could this be the reason? (TBH, > > this is the first time I use cgroup-v2). > > > > Here is the info dump: > > > > # mount | grep cgroup > > tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755) > > cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd) > > > > #cat /proc/filesystems | grep cgroup > > nodev cgroup > > nodev cgroup2 > > > > Thanks, > > Feng > > For cgroup v2, cpus_allowed should be set to cpu_possible_mask and > mems_allowed to node_possible_map as is done in the first invocation of > cpuset_bind(). That is the correct behavior. OK. For the cgroup v2 mem binding problem with hot-added nodes, I retested today, and it can't be reproduced with this patch. So feel free to add: Tested-by: Feng Tang <feng.tang@xxxxxxxxx> Thanks, Feng > Cheers, > Longman >