On Wed, Oct 11, 2023 at 4:57 PM T.J. Mercier <tjmercier@xxxxxxxxxx> wrote: > > On Tue, Oct 10, 2023 at 10:14 AM T.J. Mercier <tjmercier@xxxxxxxxxx> wrote: > > > > On Tue, Oct 10, 2023 at 9:31 AM Michal Koutný <mkoutny@xxxxxxxx> wrote: > > > > > > On Fri, Oct 06, 2023 at 11:37:19AM -0700, "T.J. Mercier" <tjmercier@xxxxxxxxxx> wrote: > > > > I suppose it's also possible there is PID reuse by the same app, > > > > causing the cgroup to become repopulated at the same time as a kill, > > > > but that seems extremely unlikely. Plus, at the point where these > > > > kills are occurring we shouldn't normally be simultaneously launching > > > > new processes for the app. Similarly if a process forks right before > > > > it is killed, maybe it doesn't show up in cgroup.procs until after > > > > we've observed it to be empty? > > > > > > Something like this: > > > > > > kill (before) > > > cgroup_fork > > > cgroup_can_fork .. begin(threadgroup_rwsem) > > > tasklist_lock > > > fatal_signal_pending -> cgroup_cancel_fork kill (mid) > > > tasklist_unlock > > > seq_start, > > > seq_next... > > > > > > cgroup_post_fork .. end(threadgroup_rwsem) > > > kill (after) > > > > > > Only the third option `kill (after)` means the child would end up on the > > > css_set list. But that would mean the reader squeezed before > > > cgroup_post_fork() would still see the parent. > > > (I.e. I don't see the kill/fork race could skew the listed procs.) > > > > > So here is a trace from a phone where the kills happen (~100ms) after > > the forks. All but one of the children die before we read cgroup.procs > > for the first time, and cgroup.procs is not empty. 5ms later we read > > again and cgroup.procs is empty, but the last child still hasn't > > exited. So it makes sense that the cset from that last child is still > > on the list. > > https://pastebin.com/raw/tnHhnZBE > > > Collected a bit more info. It's before exit_mm that the process > disappears from cgroup.procs, but the delay to populated=0 seems to be > exacerbated by CPU contention during this time. What's weird is that I > can see the task blocking the rmdir on cgrp->cset_links->cset->tasks > inside of cgroup_destroy_locked when the rmdir is attempted, so I > don't understand why it doesn't show up when iterating tasks for > cgroup.procs. > > I'm going to be out until next Wednesday when I'll look some more. Back on this and pretty sure I discovered what's happening. For processes with multiple threads where each thread has reached atomic_dec_and_test(&tsk->signal->live) in do_exit (but not all have reached cgroup_exit yet), subsequent reads of cgroup.procs will skip over the process with not-yet-fully-exited thread group members because the read of task->signal->live evaluates to 0 here in css_task_iter_advance: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/cgroup/cgroup.c?h=v6.5#n4869 But the cgroup is not removable yet because cgroup_exit hasn't been called for all tasks. Since all tasks have been signaled in this case and we're just waiting for the exits to complete, I think it should be possible to turn the cgroup into a zombie on rmdir with the current behavior of cgroup.procs. Or if we change cgroup.procs to continue showing the thread group leader until all threads have finished exiting, we'd still probably have to change our userspace to accommodate the longer kill times exceeding our timeouts. So I'm going to change our userspace anyway as suggested by Tejun. But I'd be interested to hear what folks think about the potential kernel solutions as well. Thanks, T.J.