Li Zefan <lizefan@xxxxxxxxxx> writes: >> I am also seeing what looks like a leak somewhere in the cgroup code as >> well. After some runs of the same reproducer I get into a state where >> after everything is clean up. All of the control groups have been >> removed and the cgroup filesystem is unmounted, I can mount a cgroup >> filesystem with that same combindation of subsystems, but I can't mount >> a cgroup filesystem with any of those subsystems in any other >> combination. So I am guessing that the superblock is from the original >> mounting is still lingering for some reason. >> > > If this happens again, you can check /proc/cgroups, > > #subsys_name hierarchy num_cgroups enabled > cpuset 0 1 1 > debug 0 1 1 > cpu 0 1 1 > cpuacct 0 1 1 > memory 0 1 1 > devices 0 1 1 > freezer 0 1 1 > blkio 0 1 1 > > If "hierachy" is not 0, then it didn't really unmounted. If "num_cgroups" > is not 1, then there're some cgroups not really destroyed though they've > been rmdired. Interesting. It looks at some point I had some cpu and cpuacct hierarchies that never really unmounted. #subsys_name hierarchy num_cgroups enabled cpuset 0 1 1 cpu 89 1 1 cpuacct 89 1 1 memory 0 1 1 devices 0 1 1 freezer 0 1 1 net_cls 0 1 1 blkio 0 1 1 perf_event 0 1 1 hugetlb 0 1 1 And playing a little more I get the leak scenario. #subsys_name hierarchy num_cgroups enabled cpuset 0 1 1 cpu 90 3 1 cpuacct 90 3 1 memory 90 3 1 devices 0 1 1 freezer 90 3 1 net_cls 0 1 1 blkio 0 1 1 perf_event 0 1 1 hugetlb 0 1 1 So it definitely did not unmount. After echo 3 > /proc/sys/vm/drop_caches #subsys_name hierarchy num_cgroups enabled cpuset 0 1 1 cpu 90 1 1 cpuacct 90 1 1 memory 90 1 1 devices 0 1 1 freezer 90 1 1 net_cls 0 1 1 blkio 0 1 1 perf_event 0 1 1 hugetlb 0 1 1 Hmm. But after some time passes I have #subsys_name hierarchy num_cgroups enabled cpuset 0 1 1 cpu 0 1 1 cpuacct 0 1 1 memory 0 1 1 devices 0 1 1 freezer 0 1 1 net_cls 0 1 1 blkio 0 1 1 perf_event 0 1 1 hugetlb 0 1 1 Hmm. Looking farther I see what is going on. And it has nothing to do with the freezer. (I have commented out that code and reproduced it without the freezer to be doubly certain). On the exit path exit_robust_list is triggering a page fault to fault a page back in. Which since we have no memory causes the exit path to get stuck in mem_cgroup_handle_oom. Which means the following change should fix the hang. I will test it in just a second. The problem is that we only handled pending fatal signals and exiting processes when the OOM logic was enabled. Sigh. Eric diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 00a7a66..5998a57 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1792,16 +1792,6 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, unsigned int points = 0; struct task_struct *chosen = NULL; - /* - * If current has a pending SIGKILL or is exiting, then automatically - * select it. The goal is to allow it to allocate so that it may - * quickly exit and free its memory. - */ - if (fatal_signal_pending(current) || current->flags & PF_EXITING) { - set_thread_flag(TIF_MEMDIE); - return; - } - check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL); totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1; for_each_mem_cgroup_tree(iter, memcg) { @@ -2220,7 +2210,15 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask, mem_cgroup_oom_notify(memcg); spin_unlock(&memcg_oom_lock); - if (need_to_kill) { + /* + * If current has a pending SIGKILL or is exiting, then automatically + * select it. The goal is to allow it to allocate so that it may + * quickly exit and free its memory. + */ + if (fatal_signal_pending(current) || current->flags & PF_EXITING) { + set_thread_flag(TIF_MEMDIE); + finish_wait(&memcg_oom_waitq, &owait.wait); + } else if (need_to_kill) { finish_wait(&memcg_oom_waitq, &owait.wait); mem_cgroup_out_of_memory(memcg, mask, order); } else { _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers