On Wed, Oct 25, 2023 at 6:30 AM Michal Koutný <mkoutny@xxxxxxxx> wrote: > > Hi. > > On Tue, Oct 24, 2023 at 04:10:32PM -0700, "T.J. Mercier" <tjmercier@xxxxxxxxxx> wrote: > > Back on this and pretty sure I discovered what's happening. For > > processes with multiple threads where each thread has reached > > atomic_dec_and_test(&tsk->signal->live) in do_exit (but not all have > > reached cgroup_exit yet), subsequent reads of cgroup.procs will skip > > over the process with not-yet-fully-exited thread group members > > because the read of task->signal->live evaluates to 0 here in > > css_task_iter_advance: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/cgroup/cgroup.c?h=v6.5#n4869 > > Well done! It sounds plausible, the task->signal->live is not synced > via css_set_lock. > > > > > But the cgroup is not removable yet because cgroup_exit hasn't been > > called for all tasks. > > > > Since all tasks have been signaled in this case and we're just waiting > > for the exits to complete, I think it should be possible to turn the > > cgroup into a zombie on rmdir with the current behavior of > > cgroup.procs. > > In this case it could be removed but it would make the check in > cgroup_destroy_locked() way too complicated (if I understand your idea). > I was thinking to remove it from sysfs and prevent migrations / forks, but keep the cgroup and csets around in a dead state until all tasks complete their exits. Similar to how with memcg any pages charged to the memcg keep it alive in the background even after a successful rmdir. Except in this case we're guaranteed to make progress towards releasing the cgroup (without any dependency on reclaim or charge transfer) because all tasks have already begun the process of exiting. We're just waiting on the scheduler to give enough time to the tasks. The cgroup_is_populated check in cgroup_destroy_locked is what's currently blocking the removal, and in the case where nr_populated_csets is not 0 I think we'd need to iterate through all csets and ensure that each task has been signaled for a SIGKILL. Or just ensure there are only dying tasks and the thread group leader has 0 for task->signal->live since that's when cgroup.procs stops showing the process? > > > > Or if we change cgroup.procs to continue showing the thread group > > leader until all threads have finished exiting, we'd still probably > > have to change our userspace to accommodate the longer kill times > > exceeding our timeouts. > > Provided this is the cause, you could get this more (timewise) precise > info from cgroup.threads already? (PR [1] has a reproducer and its fix > describes exactly opposite listings (confusing) but I think that fix > actually works because it checks cgroup.threads additionally.) > Yes, I just tried this out and if we check both cgroup.procs and cgroup.threads then we wait long enough to be sure that we can rmdir successfully. Unfortunately that duration exceeds our current timeouts in the environment where I can reproduce this, so we eventually give up waiting and don't actually attempt the rmdir. But I'll fix that with the change to use the populated notification. > > So I'm going to change our userspace anyway as suggested by Tejun. But > > I'd be interested to hear what folks think about the potential kernel > > solutions as well. > > Despite that, I'd stick with the notifications since they use rely on > proper synchronization of cgroup-info. > > HTT, > Michal > > [1] https://github.com/systemd/systemd/pull/23561 Interesting case, and in the same part of the code. If one of the exit functions takes a long time in the leader I could see how this might happen, but I think a lot of those (mm for example) should be shared among the group members so not sure exactly what would be the cause. Maybe it's just that all the threads get scheduled before the leader. Thanks! -T.J.