On Wed 18-03-20 15:03:52, David Rientjes wrote: > When a process is oom killed as a result of memcg limits and the victim > is waiting to exit, nothing ends up actually yielding the processor back > to the victim on UP systems with preemption disabled. Instead, the > charging process simply loops in memcg reclaim and eventually soft > lockups. > > For example, on an UP system with a memcg limited to 100MB, if three > processes each charge 40MB of heap with swap disabled, one of the charging > processes can loop endlessly trying to charge memory which starves the oom > victim. This only happens if there is no reclaimable memory in the hierarchy. That is a very specific condition. I do not see any other way than having a misconfigured system with min protection preventing any reclaim. Otherwise we have cond_resched both in slab shrinking code (do_shrink_slab) and LRU shrinking shrink_lruvec. If I am wrong and those are insufficient then please be explicit about the scenario. This is a very important information to have in the changelog! [...] > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1576,6 +1576,12 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > */ > ret = should_force_charge() || out_of_memory(&oc); > mutex_unlock(&oom_lock); > + /* > + * Give a killed process a good chance to exit before trying to > + * charge memory again. > + */ > + if (ret) > + schedule_timeout_killable(1); Why are you making this conditional? Say that there is no victim to kill. The charge path would simply bail out and it would really depend on the call chain whether there is a scheduling point or not. Isn't it simply safer to call schedule_timeout_killable unconditioanlly at this stage? > return ret; > } > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3861,6 +3861,12 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > } > out: > mutex_unlock(&oom_lock); > + /* > + * Give a killed process a good chance to exit before trying to > + * allocate memory again. > + */ > + if (*did_some_progress) > + schedule_timeout_killable(1); This doesn't make much sense either. Please remember that the primary reason you are adding this schedule_timeout_killable in this path is because you want to somehow reduce the priority inversion problem mentioned by Tetsuo. Because the page allocator path doesn't lack regular scheduling points - compaction, reclaim and should_reclaim_retry etc have them. > return page; > } > -- Michal Hocko SUSE Labs