Re: [PATCH 3/3 cgroup/for-5.2-fixes] cgroup: Include dying leaders with live threads in PROCS iterations

Suren Baghdasaryan <surenb@xxxxxxxxxx> · Fri, 10 Jan 2020 16:55:27 -0800

On Fri, Jan 10, 2020 at 4:00 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
>
>  H Michal,
>
> On Fri, Jan 10, 2020 at 3:16 PM Michal Koutný <mkoutny@xxxxxxxx> wrote:
> >
> > Hello Suren.
> >
> > On Fri, Jan 10, 2020 at 01:47:19PM -0800, Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
> > > > On Fri, Jun 07, 2019 at 07:09:53PM +0200, Michal Koutný wrote:
> > > > > Wouldn't it make more sense to call
> > > > >       css_set_move_task(tsk, cset, NULL, false);
> > > > > in cgroup_release instead of cgroup_exit then?
> > > > >
> > > > > css_set_move_task triggers the cgroup emptiness notification so if we
> > > > > list group leaders with running siblings as members of the cgroup (IMO
> > > > > correct), is it consistent to deliver the (un)populated event
> > > > > that early?
> > > > > By moving to cgroup_release we would also make this notification
> > > > > analogous to SIGCHLD delivery.
> > > >
> > > > So, the current behavior is mostly historical and I don't think this
> > > > is a better behavior.  That said, I'm not sure about the cost benefit
> > > > ratio of changing the behavior at this point given how long the code
> > > > has been this way.  Another aspect is that it'd expose zombie tasks
> > > > and possibly migrations of them to each controller.
> > >
> > > Sorry for reviving an old discussion
> > Since you reply to my remark, I have to share that I found myself wrong
> > later wrt the emptiness notifications. Moving the task in cgroup_exit
> > doesn't matter if thread group contains other live tasks, the unpopulated
> > notification will be raised when the last task of thread group calls
> > group_exit, i.e. it is similar to SIGHLD.
> >
> > Now to your issue.
> >
> > > but I got a bug report from a customer about Android
> > What kernel version is that? Can be it considered equal to the current
> > Linux?
>
> I reproduced this with 4.19 kernel that had Tejun's patches backported
> but I believe the same behavior will happen with any kernel that
> contains c03cd7738a83 ("cgroup: Include dying leaders with live
> threads in PROCS iterations") including the latest. I'll try to
> confirm this on 5.4 next week.
>
> >
> > > In my testing I was able to reproduce the failure in
> > > which case rmdir() fails with EBUSY from cgroup_destroy_locked()
> > > because cgroup_is_populated() returns true.
> > That'd mean that not all tasks in the cgroup did exit() (cgroup_exit()),
> > i.e. they're still running. Can you see them in cgroup.threads/tasks?
>
> Correct, I confirmed that there are still tasks in cset->tasks list
> when this happens. Actually when I reproduced this I logged one task
> in cset->dying_tasks which was the group leader and another one in
> cset->tasks. So the last task to reach cgroup_exit() was not the group
> leader.
> However these tasks will be ignored when we enumerate them using
> css_task_iter_start()/css_task_iter_next() loop because
> css_task_iter_advance() ignores tasks with PF_EXITING flag set or
> group leaders with task->signal->live==0 (see the code lines I posted
> in my first email). So threads are there, they are PF_EXITING but at
> least one of them did not reach cgroup_exit() yet. When user space
> reads cgroup.procs/tasks they are skipped. This is the change in
> behavior which is different from what it used to be before this patch.
> Prior to it these tasks would be visible even if they are exiting.
>
> >
> > > cgroup_is_populated() depends on cgrp->nr_populated_xxx counters which
> > > IIUC will not be updated until cgroup_update_populated() is called
> > > from cgroup_exit() and that might get delayed...
> > Why do you think it's delayed?
>
> I think exit_mm() can be one of the contributors if the process owns a
> considerable amount of memory but there might be other possible
> reasons.
>
> > > So from user space's point of view the cgroup is empty and can be
> > > removed but rmdir() fails to do so.
> > As Tejun says, cgroup with only dead _tasks_ should be removable (and if
> > I'm not mistaken it is in the current kernel).

Sorry, I missed this last part.
Agree, if all tasks are dead the cgroup should be removable and it
will be if all tasks reached cgroup_exit(). The issue happens when
tasks are marked PF_EXITING, userspace reads empty cgroups.procs and
issues rmdir() before the last killed task reaches cgroup_exit().

> > Unless you do individual
> > threads migration when a thread would be separated from its (dead)
> > leader. Does your case include such migrations?

No my case just kills all tasks in the cgroup until cgroup.procs is
empty and then tries to remove the cgroup itself. No migrations.

> >
> > Michal
>
> Thanks,
> Suren.