On Wed, Jul 15, 2020 at 10:44 AM David Rientjes <rientjes@xxxxxxxxxx> wrote: > > On Wed, 15 Jul 2020, Yafang Shao wrote: > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > > index 1962232..15e0e18 100644 > > > > --- a/mm/memcontrol.c > > > > +++ b/mm/memcontrol.c > > > > @@ -1560,15 +1560,21 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > > > > .gfp_mask = gfp_mask, > > > > .order = order, > > > > }; > > > > - bool ret; > > > > + bool ret = true; > > > > > > > > if (mutex_lock_killable(&oom_lock)) > > > > return true; > > > > + > > > > + if (mem_cgroup_margin(memcg) >= (1 << order)) > > > > + goto unlock; > > > > + > > > > /* > > > > * A few threads which were not waiting at mutex_lock_killable() can > > > > * fail to bail out. Therefore, check again after holding oom_lock. > > > > */ > > > > ret = should_force_charge() || out_of_memory(&oc); > > > > + > > > > +unlock: > > > > mutex_unlock(&oom_lock); > > > > return ret; > > > > } > > > > > > Hi Yafang, > > > > > > We've run with a patch very much like this for several years and it works > > > quite successfully to prevent the unnecessary oom killing of processes. > > > > > > We do this in out_of_memory() directly, however, because we found that we > > > could prevent even *more* unnecessary killing if we checked this at the > > > "point of no return" because the selection of processes takes some > > > additional time when we might resolve the oom condition. > > > > > > > Hi David, > > > > Your proposal could also resolve the issue, > > It has successfully resolved it for several years in our kernel, we tried > an approach similiar to yours but saw many instances where memcg oom kills > continued to proceed even though the memcg information dumped to the > kernel log showed memory available. > > If this was a page or two that became available due to memory freeing, > it's not a significant difference. Instead, if this races with an oom > notification and a process exiting or being SIGKILL'd, it becomes much > harder to explain to a user why their process was oom killed when there > are tens of megabytes of memory available as shown by the kernel log (the > freeing/exiting happened during a particularly long iteration of processes > attached to the memcg, for example). > > That's what motivated a change to moving this to out_of_memory() directly, > we found that it prevented even more unnecessary oom kills, which is a > very good thing. It may only be easily observable and make a significant > difference at very large scale, however. > Thanks for the clarification. If it is the race which causes this issue and we want to reduce the race window, I don't know whether it is proper to check the memcg margin in out_of_memory() or do it before calling do_send_sig_info(). Because per my understanding, dump_header() always takes much more time than select_bad_process() especially if there're slow consoles. So the race might easily happen when doing dump_header() or dumping other information, but if we check the memcg margin after dumping this oom info, it would be strange to dump so much oom logs without killing a process. > > but I'm wondering why do > > it specifically for memcg oom? > > Doesn't it apply to global oom? > > For example, in the global oom, when selecting the processes, the > > others might free some pages and then it might allocate pages > > successfully. > > > > It's more complex because memory being allocated from the page allocator > must be physically contiguous, it's not a simple matter of comparing the > margin of available memory to the memory being charged. It could > theoretically be done but I haven't seen a use case where it has actually > mattered as opposed to memcg oom where it can happen quite readily at > scale. When memory is uncharged to a memcg because of large freeing or a > process exiting, that's immediately chargable by another process in the > same hierarchy because of its isolation as opposed to the page allocator > where that memory is up for grabs and anything on the system could > allocate it. > > > > Some may argue that this is unnecessarily exposing mem_cgroup_margin() to > > > generic mm code, but in the interest of preventing any unnecessary oom > > > kill we've found it to be helpful. > > > > > > I proposed a variant of this in https://lkml.org/lkml/2020/3/11/1089. > > > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > > --- a/include/linux/memcontrol.h > > > +++ b/include/linux/memcontrol.h > > > @@ -798,6 +798,8 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, > > > void mem_cgroup_split_huge_fixup(struct page *head); > > > #endif > > > > > > +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg); > > > + > > > #else /* CONFIG_MEMCG */ > > > > > > #define MEM_CGROUP_ID_SHIFT 0 > > > @@ -825,6 +827,10 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, > > > { > > > } > > > > > > +static inline unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) > > > +{ > > > +} > > > + > > > static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg, > > > bool in_low_reclaim) > > > { > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > --- a/mm/memcontrol.c > > > +++ b/mm/memcontrol.c > > > @@ -1282,7 +1282,7 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru, > > > * Returns the maximum amount of memory @mem can be charged with, in > > > * pages. > > > */ > > > -static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) > > > +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) > > > { > > > unsigned long margin = 0; > > > unsigned long count; > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > > --- a/mm/oom_kill.c > > > +++ b/mm/oom_kill.c > > > @@ -1109,9 +1109,23 @@ bool out_of_memory(struct oom_control *oc) > > > if (!is_sysrq_oom(oc) && !is_memcg_oom(oc)) > > > panic("System is deadlocked on memory\n"); > > > } > > > - if (oc->chosen && oc->chosen != (void *)-1UL) > > > + if (oc->chosen && oc->chosen != (void *)-1UL) { > > > + if (is_memcg_oom(oc)) { > > > + /* > > > + * If a memcg is now under its limit or current will be > > > + * exiting and freeing memory, avoid needlessly killing > > > + * chosen. > > > + */ > > > + if (mem_cgroup_margin(oc->memcg) >= (1 << oc->order) || > > > + task_will_free_mem(current)) { > > > + put_task_struct(oc->chosen); > > > + return true; > > > + } > > > + } > > > + > > > oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" : > > > "Memory cgroup out of memory"); > > > + } > > > return !!oc->chosen; > > > } > > > > > > > > > -- > > Thanks > > Yafang > > -- Thanks Yafang