Re: [Bug Report] EBUSY for cgroup rmdir after cgroup.procs empty

"T.J. Mercier" <tjmercier@xxxxxxxxxx> · Tue, 24 Oct 2023 16:10:32 -0700

On Wed, Oct 11, 2023 at 4:57 PM T.J. Mercier <tjmercier@xxxxxxxxxx> wrote:
>
> On Tue, Oct 10, 2023 at 10:14 AM T.J. Mercier <tjmercier@xxxxxxxxxx> wrote:
> >
> > On Tue, Oct 10, 2023 at 9:31 AM Michal Koutný <mkoutny@xxxxxxxx> wrote:
> > >
> > > On Fri, Oct 06, 2023 at 11:37:19AM -0700, "T.J. Mercier" <tjmercier@xxxxxxxxxx> wrote:
> > > > I suppose it's also possible there is PID reuse by the same app,
> > > > causing the cgroup to become repopulated at the same time as a kill,
> > > > but that seems extremely unlikely. Plus, at the point where these
> > > > kills are occurring we shouldn't normally be simultaneously launching
> > > > new processes for the app. Similarly if a process forks right before
> > > > it is killed, maybe it doesn't show up in cgroup.procs until after
> > > > we've observed it to be empty?
> > >
> > > Something like this:
> > >
> > >                                                         kill (before)
> > > cgroup_fork
> > > cgroup_can_fork .. begin(threadgroup_rwsem)
> > > tasklist_lock
> > > fatal_signal_pending -> cgroup_cancel_fork              kill (mid)
> > > tasklist_unlock
> > >                                                         seq_start,
> > >                                                         seq_next...
> > >
> > > cgroup_post_fork  .. end(threadgroup_rwsem)
> > >                                                         kill (after)
> > >
> > > Only the third option `kill (after)` means the child would end up on the
> > > css_set list. But that would mean the reader squeezed before
> > > cgroup_post_fork() would still see the parent.
> > > (I.e. I don't see the kill/fork race could skew the listed procs.)
> > >
> > So here is a trace from a phone where the kills happen (~100ms) after
> > the forks. All but one of the children die before we read cgroup.procs
> > for the first time, and cgroup.procs is not empty. 5ms later we read
> > again and cgroup.procs is empty, but the last child still hasn't
> > exited. So it makes sense that the cset from that last child is still
> > on the list.
> > https://pastebin.com/raw/tnHhnZBE
> >
> Collected a bit more info. It's before exit_mm that the process
> disappears from cgroup.procs, but the delay to populated=0 seems to be
> exacerbated by CPU contention during this time. What's weird is that I
> can see the task blocking the rmdir on cgrp->cset_links->cset->tasks
> inside of cgroup_destroy_locked when the rmdir is attempted, so I
> don't understand why it doesn't show up when iterating tasks for
> cgroup.procs.
>
> I'm going to be out until next Wednesday when I'll look some more.

Back on this and pretty sure I discovered what's happening. For
processes with multiple threads where each thread has reached
atomic_dec_and_test(&tsk->signal->live) in do_exit (but not all have
reached cgroup_exit yet), subsequent reads of cgroup.procs will skip
over the process with not-yet-fully-exited thread group members
because the read of task->signal->live evaluates to 0 here in
css_task_iter_advance:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/cgroup/cgroup.c?h=v6.5#n4869

But the cgroup is not removable yet because cgroup_exit hasn't been
called for all tasks.

Since all tasks have been signaled in this case and we're just waiting
for the exits to complete, I think it should be possible to turn the
cgroup into a zombie on rmdir with the current behavior of
cgroup.procs.

Or if we change cgroup.procs to continue showing the thread group
leader until all threads have finished exiting, we'd still probably
have to change our userspace to accommodate the longer kill times
exceeding our timeouts. So I'm going to change our userspace anyway as
suggested by Tejun. But I'd be interested to hear what folks think
about the potential kernel solutions as well.

Thanks,
T.J.