Hi Tejun, On Tue, Jun 11, 2019 at 12:10 PM Tejun Heo <tj@xxxxxxxxxx> wrote: > > Hello, Michal. > > On Fri, Jun 07, 2019 at 07:09:53PM +0200, Michal Koutný wrote: > > Wouldn't it make more sense to call > > css_set_move_task(tsk, cset, NULL, false); > > in cgroup_release instead of cgroup_exit then? > > > > css_set_move_task triggers the cgroup emptiness notification so if we > > list group leaders with running siblings as members of the cgroup (IMO > > correct), is it consistent to deliver the (un)populated event > > that early? > > By moving to cgroup_release we would also make this notification > > analogous to SIGCHLD delivery. > > So, the current behavior is mostly historical and I don't think this > is a better behavior. That said, I'm not sure about the cost benefit > ratio of changing the behavior at this point given how long the code > has been this way. Another aspect is that it'd expose zombie tasks > and possibly migrations of them to each controller. Sorry for reviving an old discussion but I got a bug report from a customer about Android occasionally failing to remove a cgroup after it killed all processes in it. AndroidManagerService (a privileged user space process) reads cgroup.procs and sends SIGKILL to all listed processes in that cgroup until cgroup.procs is empty at which point it tries to delete cgroup directory (it is a leaf cgroup with empty cgroup.procs). In my testing I was able to reproduce the failure in which case rmdir() fails with EBUSY from cgroup_destroy_locked() because cgroup_is_populated() returns true. cgroup_is_populated() depends on cgrp->nr_populated_xxx counters which IIUC will not be updated until cgroup_update_populated() is called from cgroup_exit() and that might get delayed... Meanwhile the tasks counted by cgrp->nr_populated_xxx will not show up when cgroup.procs or cgroup.tasks/tasks files are read by user space due to these changes: + /* and dying leaders w/o live member threads */ + if (!atomic_read(&task->signal->live)) + goto repeat; + } else { + /* skip all dying ones */ + if (task->flags & PF_EXITING) + goto repeat; + } After removing these checks cgroup.procs shows the processes up until they are dead and the issue is gone. So from user space's point of view the cgroup is empty and can be removed but rmdir() fails to do so. I think this behavior is inconsistent with the claim "cgroup which doesn't have any children and is associated only with zombie processes is considered empty and can be removed" from https://elixir.bootlin.com/linux/v5.4.10/source/Documentation/admin-guide/cgroup-v2.rst#L222 . Before contemplating what would be the right fix here I wanted to check if cgroup.procs being empty is a reasonable indication for user space to assume the cgroup is empty and can be removed? I think it is but I might be wrong. Note that this behavior was reproduced on cgroup v1 hierarchy but I think it will happen on v2 as well. If needed I can cook up a reproducer. Thanks, Suren. > > Thanks. > > -- > tejun