Re: [PATCH 3/3 cgroup/for-5.2-fixes] cgroup: Include dying leaders with live threads in PROCS iterations

Suren Baghdasaryan <surenb@xxxxxxxxxx> · Fri, 10 Jan 2020 13:47:19 -0800

Hi Tejun,

On Tue, Jun 11, 2019 at 12:10 PM Tejun Heo <tj@xxxxxxxxxx> wrote:
>
> Hello, Michal.
>
> On Fri, Jun 07, 2019 at 07:09:53PM +0200, Michal Koutný wrote:
> > Wouldn't it make more sense to call
> >       css_set_move_task(tsk, cset, NULL, false);
> > in cgroup_release instead of cgroup_exit then?
> >
> > css_set_move_task triggers the cgroup emptiness notification so if we
> > list group leaders with running siblings as members of the cgroup (IMO
> > correct), is it consistent to deliver the (un)populated event
> > that early?
> > By moving to cgroup_release we would also make this notification
> > analogous to SIGCHLD delivery.
>
> So, the current behavior is mostly historical and I don't think this
> is a better behavior.  That said, I'm not sure about the cost benefit
> ratio of changing the behavior at this point given how long the code
> has been this way.  Another aspect is that it'd expose zombie tasks
> and possibly migrations of them to each controller.

Sorry for reviving an old discussion but I got a bug report from a
customer about Android occasionally failing to remove a cgroup after
it killed all processes in it. AndroidManagerService (a privileged
user space process) reads cgroup.procs and sends SIGKILL to all listed
processes in that cgroup until cgroup.procs is empty at which point it
tries to delete cgroup directory (it is a leaf cgroup with empty
cgroup.procs). In my testing I was able to reproduce the failure in
which case rmdir() fails with EBUSY from cgroup_destroy_locked()
because cgroup_is_populated() returns true. cgroup_is_populated()
depends on cgrp->nr_populated_xxx counters which IIUC will not be
updated until cgroup_update_populated() is called from cgroup_exit()
and that might get delayed... Meanwhile the tasks counted by
cgrp->nr_populated_xxx will not show up when cgroup.procs or
cgroup.tasks/tasks files are read by user space due to these changes:

+               /* and dying leaders w/o live member threads */
+               if (!atomic_read(&task->signal->live))
+                       goto repeat;
+       } else {
+               /* skip all dying ones */
+               if (task->flags & PF_EXITING)
+                       goto repeat;
+       }

After removing these checks cgroup.procs shows the processes up until
they are dead and the issue is gone.

So from user space's point of view the cgroup is empty and can be
removed but rmdir() fails to do so. I think this behavior is
inconsistent with the claim "cgroup which doesn't have any children
and is associated only with zombie processes is considered empty and
can be removed" from
https://elixir.bootlin.com/linux/v5.4.10/source/Documentation/admin-guide/cgroup-v2.rst#L222
.

Before contemplating what would be the right fix here I wanted to
check if cgroup.procs being empty is a reasonable indication for user
space to assume the cgroup is empty and can be removed? I think it is
but I might be wrong.
Note that this behavior was reproduced on cgroup v1 hierarchy but I
think it will happen on v2 as well. If needed I can cook up a
reproducer.
Thanks,
Suren.

>
> Thanks.

>
> --
> tejun