On Mon, Jul 13, 2020 at 8:45 PM Michal Hocko <mhocko@xxxxxxxxxx> wrote: > > On Mon 13-07-20 20:24:07, Yafang Shao wrote: > > On Mon, Jul 13, 2020 at 2:21 PM Michal Hocko <mhocko@xxxxxxxxxx> wrote: > > > > > > On Mon 13-07-20 08:01:57, Michal Hocko wrote: > > > > On Fri 10-07-20 23:18:01, Yafang Shao wrote: > > > [...] > > > > > There're many threads of a multi-threaded task parallel running in a > > > > > container on many cpus. Then many threads triggered OOM at the same time, > > > > > > > > > > CPU-1 CPU-2 ... CPU-n > > > > > thread-1 thread-2 ... thread-n > > > > > > > > > > wait oom_lock wait oom_lock ... hold oom_lock > > > > > > > > > > (sigkill received) > > > > > > > > > > select current as victim > > > > > and wakeup oom reaper > > > > > > > > > > release oom_lock > > > > > > > > > > (MMF_OOM_SKIP set by oom reaper) > > > > > > > > > > (lots of pages are freed) > > > > > hold oom_lock > > > > > > > > Could you be more specific please? The page allocator never waits for > > > > the oom_lock and keeps retrying instead. Also __alloc_pages_may_oom > > > > tries to allocate with the lock held. > > > > > > I suspect that you are looking at memcg oom killer. > > > > Right, these threads were waiting the oom_lock in mem_cgroup_out_of_memory(). > > > > > Because we do not do > > > trylock there for some reason I do not immediatelly remember from top of > > > my head. If this is really the case then I would recommend looking into > > > how the page allocator implements this and follow the same pattern for > > > memcg as well. > > > > > > > That is a good suggestion. > > But we can't try locking the global oom_lock here, because task ooming > > in memcg foo may can't help the tasks in memcg bar. > > I do not follow. oom_lock is not about fwd progress. It is a big lock to > synchronize against oom_disable logic. > > I have this in mind > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 248e6cad0095..29d1f8c2d968 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1563,8 +1563,10 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > }; > bool ret; > > - if (mutex_lock_killable(&oom_lock)) > + if (!mutex_trylock(&oom_lock)) > return true; root_mem_cgroup / \ memcg_a (16G) memcg_b (32G) | | process a_1 (reach memcg_a limit) process b_1(reach memcg_b limit) hold oom_lock wait oom_lock So we can find that process a_1 will try to kill process in memcg_a, while process b_1 need to try to kill process in memcg_b. IOW, the process killed in memcg_a can't help the processes in memcg_b, so if process b_1 should not trylock oom_lock here. While if the memcg tree is , target mem_cgroup (16G) / \ | | process a_1 (reach memcg_a limit) process a_2(reach memcg_a limit) hold oom_lock wait oom_lock Then, process a_2 can trylock oom_lock here. IOW, these processes should in the same memcg. That's why I said that we should introduce per-memcg oom_lock. > + > + > /* > * A few threads which were not waiting at mutex_lock_killable() can > * fail to bail out. Therefore, check again after holding oom_lock. > > But as I've said I would need to double check the history on why we > differ here. Btw. I suspect that mem_cgroup_out_of_memory call in > mem_cgroup_oom_synchronize is bogus and can no longer trigger after > 29ef680ae7c21 but this needs double checking as well. > > > IOW, we need to introduce the per memcg oom_lock, like bellow, > > I do not see why. Besides that we already do have per oom memcg > hierarchy lock. > -- Thanks Yafang