Re: [PATCH v2] memcg, oom: check memcg margin for parallel oom

Yafang Shao <laoar.shao@xxxxxxxxx> · Wed, 15 Jul 2020 11:10:07 +0800

On Wed, Jul 15, 2020 at 10:44 AM David Rientjes <rientjes@xxxxxxxxxx> wrote:
>
> On Wed, 15 Jul 2020, Yafang Shao wrote:
>
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > index 1962232..15e0e18 100644
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -1560,15 +1560,21 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > > >               .gfp_mask = gfp_mask,
> > > >               .order = order,
> > > >       };
> > > > -     bool ret;
> > > > +     bool ret = true;
> > > >
> > > >       if (mutex_lock_killable(&oom_lock))
> > > >               return true;
> > > > +
> > > > +     if (mem_cgroup_margin(memcg) >= (1 << order))
> > > > +             goto unlock;
> > > > +
> > > >       /*
> > > >        * A few threads which were not waiting at mutex_lock_killable() can
> > > >        * fail to bail out. Therefore, check again after holding oom_lock.
> > > >        */
> > > >       ret = should_force_charge() || out_of_memory(&oc);
> > > > +
> > > > +unlock:
> > > >       mutex_unlock(&oom_lock);
> > > >       return ret;
> > > >  }
> > >
> > > Hi Yafang,
> > >
> > > We've run with a patch very much like this for several years and it works
> > > quite successfully to prevent the unnecessary oom killing of processes.
> > >
> > > We do this in out_of_memory() directly, however, because we found that we
> > > could prevent even *more* unnecessary killing if we checked this at the
> > > "point of no return" because the selection of processes takes some
> > > additional time when we might resolve the oom condition.
> > >
> >
> > Hi David,
> >
> > Your proposal could also resolve the issue,
>
> It has successfully resolved it for several years in our kernel, we tried
> an approach similiar to yours but saw many instances where memcg oom kills
> continued to proceed even though the memcg information dumped to the
> kernel log showed memory available.
>
> If this was a page or two that became available due to memory freeing,
> it's not a significant difference.  Instead, if this races with an oom
> notification and a process exiting or being SIGKILL'd, it becomes much
> harder to explain to a user why their process was oom killed when there
> are tens of megabytes of memory available as shown by the kernel log (the
> freeing/exiting happened during a particularly long iteration of processes
> attached to the memcg, for example).
>
> That's what motivated a change to moving this to out_of_memory() directly,
> we found that it prevented even more unnecessary oom kills, which is a
> very good thing.  It may only be easily observable and make a significant
> difference at very large scale, however.
>

Thanks for the clarification.

If it is the race which causes this issue and we want to reduce the
race window, I don't know whether it is proper to check the memcg
margin in out_of_memory() or  do it before calling do_send_sig_info().
Because per my understanding, dump_header() always takes much more
time than select_bad_process() especially if there're slow consoles.
So the race might easily happen when doing dump_header()  or dumping
other information, but if we check the memcg margin after dumping this
oom info, it would be strange to dump so much oom logs without killing
a process.

> > but I'm wondering why do
> > it specifically for memcg oom?
> > Doesn't it apply to global oom?
> > For example, in the global oom, when selecting the processes, the
> > others might free some pages and then it might allocate pages
> > successfully.
> >
>
> It's more complex because memory being allocated from the page allocator
> must be physically contiguous, it's not a simple matter of comparing the
> margin of available memory to the memory being charged.  It could
> theoretically be done but I haven't seen a use case where it has actually
> mattered as opposed to memcg oom where it can happen quite readily at
> scale.  When memory is uncharged to a memcg because of large freeing or a
> process exiting, that's immediately chargable by another process in the
> same hierarchy because of its isolation as opposed to the page allocator
> where that memory is up for grabs and anything on the system could
> allocate it.
>
> > > Some may argue that this is unnecessarily exposing mem_cgroup_margin() to
> > > generic mm code, but in the interest of preventing any unnecessary oom
> > > kill we've found it to be helpful.
> > >
> > > I proposed a variant of this in https://lkml.org/lkml/2020/3/11/1089.
> > >
> > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > --- a/include/linux/memcontrol.h
> > > +++ b/include/linux/memcontrol.h
> > > @@ -798,6 +798,8 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
> > >  void mem_cgroup_split_huge_fixup(struct page *head);
> > >  #endif
> > >
> > > +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg);
> > > +
> > >  #else /* CONFIG_MEMCG */
> > >
> > >  #define MEM_CGROUP_ID_SHIFT    0
> > > @@ -825,6 +827,10 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
> > >  {
> > >  }
> > >
> > > +static inline unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> > > +{
> > > +}
> > > +
> > >  static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg,
> > >                                                   bool in_low_reclaim)
> > >  {
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -1282,7 +1282,7 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
> > >   * Returns the maximum amount of memory @mem can be charged with, in
> > >   * pages.
> > >   */
> > > -static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> > > +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> > >  {
> > >         unsigned long margin = 0;
> > >         unsigned long count;
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -1109,9 +1109,23 @@ bool out_of_memory(struct oom_control *oc)
> > >                 if (!is_sysrq_oom(oc) && !is_memcg_oom(oc))
> > >                         panic("System is deadlocked on memory\n");
> > >         }
> > > -       if (oc->chosen && oc->chosen != (void *)-1UL)
> > > +       if (oc->chosen && oc->chosen != (void *)-1UL) {
> > > +               if (is_memcg_oom(oc)) {
> > > +                       /*
> > > +                        * If a memcg is now under its limit or current will be
> > > +                        * exiting and freeing memory, avoid needlessly killing
> > > +                        * chosen.
> > > +                        */
> > > +                       if (mem_cgroup_margin(oc->memcg) >= (1 << oc->order) ||
> > > +                           task_will_free_mem(current)) {
> > > +                               put_task_struct(oc->chosen);
> > > +                               return true;
> > > +                       }
> > > +               }
> > > +
> > >                 oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
> > >                                  "Memory cgroup out of memory");
> > > +       }
> > >         return !!oc->chosen;
> > >  }
> > >
> >
> >
> > --
> > Thanks
> > Yafang
> >

-- 
Thanks
Yafang