On Mon, Jul 13, 2020 at 2:21 PM Michal Hocko <mhocko@xxxxxxxxxx> wrote: > > On Mon 13-07-20 08:01:57, Michal Hocko wrote: > > On Fri 10-07-20 23:18:01, Yafang Shao wrote: > [...] > > > There're many threads of a multi-threaded task parallel running in a > > > container on many cpus. Then many threads triggered OOM at the same time, > > > > > > CPU-1 CPU-2 ... CPU-n > > > thread-1 thread-2 ... thread-n > > > > > > wait oom_lock wait oom_lock ... hold oom_lock > > > > > > (sigkill received) > > > > > > select current as victim > > > and wakeup oom reaper > > > > > > release oom_lock > > > > > > (MMF_OOM_SKIP set by oom reaper) > > > > > > (lots of pages are freed) > > > hold oom_lock > > > > Could you be more specific please? The page allocator never waits for > > the oom_lock and keeps retrying instead. Also __alloc_pages_may_oom > > tries to allocate with the lock held. > > I suspect that you are looking at memcg oom killer. Right, these threads were waiting the oom_lock in mem_cgroup_out_of_memory(). > Because we do not do > trylock there for some reason I do not immediatelly remember from top of > my head. If this is really the case then I would recommend looking into > how the page allocator implements this and follow the same pattern for > memcg as well. > That is a good suggestion. But we can't try locking the global oom_lock here, because task ooming in memcg foo may can't help the tasks in memcg bar. IOW, we need to introduce the per memcg oom_lock, like bellow, mem_cgroup_out_of_memory + if (mutex_trylock(memcg->lock)) + return true. if (mutex_lock_killable(&oom_lock)) return true; And the memcg tree should also be considered. -- Thanks Yafang