Re: [PATCH 3/3 cgroup/for-5.2-fixes] cgroup: Include dying leaders with live threads in PROCS iterations

Suren Baghdasaryan <surenb@xxxxxxxxxx> · Wed, 15 Jan 2020 20:26:09 -0800

On Fri, Jan 10, 2020 at 4:55 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
>
> On Fri, Jan 10, 2020 at 4:00 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
> >
> >  H Michal,
> >
> > On Fri, Jan 10, 2020 at 3:16 PM Michal Koutný <mkoutny@xxxxxxxx> wrote:
> > >
> > > Hello Suren.
> > >
> > > On Fri, Jan 10, 2020 at 01:47:19PM -0800, Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
> > > > > On Fri, Jun 07, 2019 at 07:09:53PM +0200, Michal Koutný wrote:
> > > > > > Wouldn't it make more sense to call
> > > > > >       css_set_move_task(tsk, cset, NULL, false);
> > > > > > in cgroup_release instead of cgroup_exit then?
> > > > > >
> > > > > > css_set_move_task triggers the cgroup emptiness notification so if we
> > > > > > list group leaders with running siblings as members of the cgroup (IMO
> > > > > > correct), is it consistent to deliver the (un)populated event
> > > > > > that early?
> > > > > > By moving to cgroup_release we would also make this notification
> > > > > > analogous to SIGCHLD delivery.
> > > > >
> > > > > So, the current behavior is mostly historical and I don't think this
> > > > > is a better behavior.  That said, I'm not sure about the cost benefit
> > > > > ratio of changing the behavior at this point given how long the code
> > > > > has been this way.  Another aspect is that it'd expose zombie tasks
> > > > > and possibly migrations of them to each controller.
> > > >
> > > > Sorry for reviving an old discussion
> > > Since you reply to my remark, I have to share that I found myself wrong
> > > later wrt the emptiness notifications. Moving the task in cgroup_exit
> > > doesn't matter if thread group contains other live tasks, the unpopulated
> > > notification will be raised when the last task of thread group calls
> > > group_exit, i.e. it is similar to SIGHLD.
> > >
> > > Now to your issue.
> > >
> > > > but I got a bug report from a customer about Android
> > > What kernel version is that? Can be it considered equal to the current
> > > Linux?
> >
> > I reproduced this with 4.19 kernel that had Tejun's patches backported
> > but I believe the same behavior will happen with any kernel that
> > contains c03cd7738a83 ("cgroup: Include dying leaders with live
> > threads in PROCS iterations") including the latest. I'll try to
> > confirm this on 5.4 next week.

Hi Folks,
I confirmed that the issue is happening on 5.4 as well. I did
implement a fix for this and a kselftest to reproduce the problem.
Will send both the fix and the kselftest as a 2-patch series shortly.
Thanks,
Suren.

> >
> > >
> > > > In my testing I was able to reproduce the failure in
> > > > which case rmdir() fails with EBUSY from cgroup_destroy_locked()
> > > > because cgroup_is_populated() returns true.
> > > That'd mean that not all tasks in the cgroup did exit() (cgroup_exit()),
> > > i.e. they're still running. Can you see them in cgroup.threads/tasks?
> >
> > Correct, I confirmed that there are still tasks in cset->tasks list
> > when this happens. Actually when I reproduced this I logged one task
> > in cset->dying_tasks which was the group leader and another one in
> > cset->tasks. So the last task to reach cgroup_exit() was not the group
> > leader.
> > However these tasks will be ignored when we enumerate them using
> > css_task_iter_start()/css_task_iter_next() loop because
> > css_task_iter_advance() ignores tasks with PF_EXITING flag set or
> > group leaders with task->signal->live==0 (see the code lines I posted
> > in my first email). So threads are there, they are PF_EXITING but at
> > least one of them did not reach cgroup_exit() yet. When user space
> > reads cgroup.procs/tasks they are skipped. This is the change in
> > behavior which is different from what it used to be before this patch.
> > Prior to it these tasks would be visible even if they are exiting.
> >
> > >
> > > > cgroup_is_populated() depends on cgrp->nr_populated_xxx counters which
> > > > IIUC will not be updated until cgroup_update_populated() is called
> > > > from cgroup_exit() and that might get delayed...
> > > Why do you think it's delayed?
> >
> > I think exit_mm() can be one of the contributors if the process owns a
> > considerable amount of memory but there might be other possible
> > reasons.
> >
> > > > So from user space's point of view the cgroup is empty and can be
> > > > removed but rmdir() fails to do so.
> > > As Tejun says, cgroup with only dead _tasks_ should be removable (and if
> > > I'm not mistaken it is in the current kernel).
>
> Sorry, I missed this last part.
> Agree, if all tasks are dead the cgroup should be removable and it
> will be if all tasks reached cgroup_exit(). The issue happens when
> tasks are marked PF_EXITING, userspace reads empty cgroups.procs and
> issues rmdir() before the last killed task reaches cgroup_exit().
>
> > > Unless you do individual
> > > threads migration when a thread would be separated from its (dead)
> > > leader. Does your case include such migrations?
>
> No my case just kills all tasks in the cgroup until cgroup.procs is
> empty and then tries to remove the cgroup itself. No migrations.
>
> > >
> > > Michal
> >
> > Thanks,
> > Suren.