On Fri, Jan 10, 2020 at 4:00 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > H Michal, > > On Fri, Jan 10, 2020 at 3:16 PM Michal Koutný <mkoutny@xxxxxxxx> wrote: > > > > Hello Suren. > > > > On Fri, Jan 10, 2020 at 01:47:19PM -0800, Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > > > On Fri, Jun 07, 2019 at 07:09:53PM +0200, Michal Koutný wrote: > > > > > Wouldn't it make more sense to call > > > > > css_set_move_task(tsk, cset, NULL, false); > > > > > in cgroup_release instead of cgroup_exit then? > > > > > > > > > > css_set_move_task triggers the cgroup emptiness notification so if we > > > > > list group leaders with running siblings as members of the cgroup (IMO > > > > > correct), is it consistent to deliver the (un)populated event > > > > > that early? > > > > > By moving to cgroup_release we would also make this notification > > > > > analogous to SIGCHLD delivery. > > > > > > > > So, the current behavior is mostly historical and I don't think this > > > > is a better behavior. That said, I'm not sure about the cost benefit > > > > ratio of changing the behavior at this point given how long the code > > > > has been this way. Another aspect is that it'd expose zombie tasks > > > > and possibly migrations of them to each controller. > > > > > > Sorry for reviving an old discussion > > Since you reply to my remark, I have to share that I found myself wrong > > later wrt the emptiness notifications. Moving the task in cgroup_exit > > doesn't matter if thread group contains other live tasks, the unpopulated > > notification will be raised when the last task of thread group calls > > group_exit, i.e. it is similar to SIGHLD. > > > > Now to your issue. > > > > > but I got a bug report from a customer about Android > > What kernel version is that? Can be it considered equal to the current > > Linux? > > I reproduced this with 4.19 kernel that had Tejun's patches backported > but I believe the same behavior will happen with any kernel that > contains c03cd7738a83 ("cgroup: Include dying leaders with live > threads in PROCS iterations") including the latest. I'll try to > confirm this on 5.4 next week. > > > > > > In my testing I was able to reproduce the failure in > > > which case rmdir() fails with EBUSY from cgroup_destroy_locked() > > > because cgroup_is_populated() returns true. > > That'd mean that not all tasks in the cgroup did exit() (cgroup_exit()), > > i.e. they're still running. Can you see them in cgroup.threads/tasks? > > Correct, I confirmed that there are still tasks in cset->tasks list > when this happens. Actually when I reproduced this I logged one task > in cset->dying_tasks which was the group leader and another one in > cset->tasks. So the last task to reach cgroup_exit() was not the group > leader. > However these tasks will be ignored when we enumerate them using > css_task_iter_start()/css_task_iter_next() loop because > css_task_iter_advance() ignores tasks with PF_EXITING flag set or > group leaders with task->signal->live==0 (see the code lines I posted > in my first email). So threads are there, they are PF_EXITING but at > least one of them did not reach cgroup_exit() yet. When user space > reads cgroup.procs/tasks they are skipped. This is the change in > behavior which is different from what it used to be before this patch. > Prior to it these tasks would be visible even if they are exiting. > > > > > > cgroup_is_populated() depends on cgrp->nr_populated_xxx counters which > > > IIUC will not be updated until cgroup_update_populated() is called > > > from cgroup_exit() and that might get delayed... > > Why do you think it's delayed? > > I think exit_mm() can be one of the contributors if the process owns a > considerable amount of memory but there might be other possible > reasons. > > > > So from user space's point of view the cgroup is empty and can be > > > removed but rmdir() fails to do so. > > As Tejun says, cgroup with only dead _tasks_ should be removable (and if > > I'm not mistaken it is in the current kernel). Sorry, I missed this last part. Agree, if all tasks are dead the cgroup should be removable and it will be if all tasks reached cgroup_exit(). The issue happens when tasks are marked PF_EXITING, userspace reads empty cgroups.procs and issues rmdir() before the last killed task reaches cgroup_exit(). > > Unless you do individual > > threads migration when a thread would be separated from its (dead) > > leader. Does your case include such migrations? No my case just kills all tasks in the cgroup until cgroup.procs is empty and then tries to remove the cgroup itself. No migrations. > > > > Michal > > Thanks, > Suren.