On Mon 13-07-20 20:24:07, Yafang Shao wrote: > On Mon, Jul 13, 2020 at 2:21 PM Michal Hocko <mhocko@xxxxxxxxxx> wrote: > > > > On Mon 13-07-20 08:01:57, Michal Hocko wrote: > > > On Fri 10-07-20 23:18:01, Yafang Shao wrote: > > [...] > > > > There're many threads of a multi-threaded task parallel running in a > > > > container on many cpus. Then many threads triggered OOM at the same time, > > > > > > > > CPU-1 CPU-2 ... CPU-n > > > > thread-1 thread-2 ... thread-n > > > > > > > > wait oom_lock wait oom_lock ... hold oom_lock > > > > > > > > (sigkill received) > > > > > > > > select current as victim > > > > and wakeup oom reaper > > > > > > > > release oom_lock > > > > > > > > (MMF_OOM_SKIP set by oom reaper) > > > > > > > > (lots of pages are freed) > > > > hold oom_lock > > > > > > Could you be more specific please? The page allocator never waits for > > > the oom_lock and keeps retrying instead. Also __alloc_pages_may_oom > > > tries to allocate with the lock held. > > > > I suspect that you are looking at memcg oom killer. > > Right, these threads were waiting the oom_lock in mem_cgroup_out_of_memory(). > > > Because we do not do > > trylock there for some reason I do not immediatelly remember from top of > > my head. If this is really the case then I would recommend looking into > > how the page allocator implements this and follow the same pattern for > > memcg as well. > > > > That is a good suggestion. > But we can't try locking the global oom_lock here, because task ooming > in memcg foo may can't help the tasks in memcg bar. I do not follow. oom_lock is not about fwd progress. It is a big lock to synchronize against oom_disable logic. I have this in mind diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 248e6cad0095..29d1f8c2d968 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1563,8 +1563,10 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, }; bool ret; - if (mutex_lock_killable(&oom_lock)) + if (!mutex_trylock(&oom_lock)) return true; + + /* * A few threads which were not waiting at mutex_lock_killable() can * fail to bail out. Therefore, check again after holding oom_lock. But as I've said I would need to double check the history on why we differ here. Btw. I suspect that mem_cgroup_out_of_memory call in mem_cgroup_oom_synchronize is bogus and can no longer trigger after 29ef680ae7c21 but this needs double checking as well. > IOW, we need to introduce the per memcg oom_lock, like bellow, I do not see why. Besides that we already do have per oom memcg hierarchy lock. -- Michal Hocko SUSE Labs