Re: [PATCH] mm,oom: use ALLOC_OOM for OOM victim's last second allocation

Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> · Sat, 23 Dec 2017 23:41:44 +0900

Michal Hocko wrote:
> On Fri 22-12-17 00:34:05, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> [...]
> > >                                   Let me repeat something I've said a
> > > long ago. We do not optimize for corner cases. We want to survive but if
> > > an alternative is to kill another task then we can live with that.
> > >  
> > 
> > Setting MMF_OOM_SKIP before all OOM-killed threads try memory reserves
> > leads to needlessly selecting more OOM victims.
> > 
> > Unless any OOM-killed thread fails to satisfy allocation even with ALLOC_OOM,
> > no OOM-killed thread needs to select more OOM victims. Commit 696453e66630ad45
> > ("mm, oom: task_will_free_mem should skip oom_reaped tasks") obviously broke
> > it, which is exactly a regression.
> 
> You are trying to fix a completely artificial case. Or do you have any
> example of an application which uses CLONE_VM without sharing signals?

Your response is an invalid and insane resistance.

You dare to silently made user visible changes. If you really believe that
there is no application which uses CLONE_VM without sharing signals, let's
revert below patch.

----------
commit 44a70adec910d6929689e42b6e5cee5b7d202d20
Author: Michal Hocko <mhocko@xxxxxxxx>
Date:   Thu Jul 28 15:44:43 2016 -0700

    mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj

    oom_score_adj is shared for the thread groups (via struct signal) but this
    is not sufficient to cover processes sharing mm (CLONE_VM without
    CLONE_SIGHAND) and so we can easily end up in a situation when some
    processes update their oom_score_adj and confuse the oom killer.  In the
    worst case some of those processes might hide from the oom killer
    altogether via OOM_SCORE_ADJ_MIN while others are eligible.  OOM killer
    would then pick up those eligible but won't be allowed to kill others
    sharing the same mm so the mm wouldn't release the mm and so the memory.

    It would be ideal to have the oom_score_adj per mm_struct because that is
    the natural entity OOM killer considers.  But this will not work because
    some programs are doing

        vfork()
        set_oom_adj()
        exec()

    We can achieve the same though.  oom_score_adj write handler can set the
    oom_score_adj for all processes sharing the same mm if the task is not in
    the middle of vfork.  As a result all the processes will share the same
    oom_score_adj.  The current implementation is rather pessimistic and
    checks all the existing processes by default if there is more than 1
    holder of the mm but we do not have any reliable way to check for external
    users yet.

    Link: http://lkml.kernel.org/r/1466426628-15074-5-git-send-email-mhocko@xxxxxxxxxx
    Signed-off-by: Michal Hocko <mhocko@xxxxxxxx>
    Acked-by: Oleg Nesterov <oleg@xxxxxxxxxx>
    Cc: Vladimir Davydov <vdavydov@xxxxxxxxxxxxx>
    Cc: David Rientjes <rientjes@xxxxxxxxxx>
    Cc: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
    Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
    Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
----------

We do prepare for the worst case. We have just experienced two mistakes
by not considering the worst case.

----------
commit 400e22499dd92613821374c8c6c88c7225359980
Author: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
Date:   Wed Nov 15 17:38:37 2017 -0800

    mm: don't warn about allocations which stall for too long

    Commit 63f53dea0c98 ("mm: warn about allocations which stall for too
    long") was a great step for reducing possibility of silent hang up
    problem caused by memory allocation stalls.  But this commit reverts it,
    for it is possible to trigger OOM lockup and/or soft lockups when many
    threads concurrently called warn_alloc() (in order to warn about memory
    allocation stalls) due to current implementation of printk(), and it is
    difficult to obtain useful information due to limitation of synchronous
    warning approach.

    Current printk() implementation flushes all pending logs using the
    context of a thread which called console_unlock().  printk() should be
    able to flush all pending logs eventually unless somebody continues
    appending to printk() buffer.

    Since warn_alloc() started appending to printk() buffer while waiting
    for oom_kill_process() to make forward progress when oom_kill_process()
    is processing pending logs, it became possible for warn_alloc() to force
    oom_kill_process() loop inside printk().  As a result, warn_alloc()
    significantly increased possibility of preventing oom_kill_process()
    from making forward progress.

    ---------- Pseudo code start ----------
    Before warn_alloc() was introduced:

      retry:
        if (mutex_trylock(&oom_lock)) {
          while (atomic_read(&printk_pending_logs) > 0) {
            atomic_dec(&printk_pending_logs);
            print_one_log();
          }
          // Send SIGKILL here.
          mutex_unlock(&oom_lock)
        }
        goto retry;

    After warn_alloc() was introduced:

      retry:
        if (mutex_trylock(&oom_lock)) {
          while (atomic_read(&printk_pending_logs) > 0) {
            atomic_dec(&printk_pending_logs);
            print_one_log();
          }
          // Send SIGKILL here.
          mutex_unlock(&oom_lock)
        } else if (waited_for_10seconds()) {
          atomic_inc(&printk_pending_logs);
        }
        goto retry;
    ---------- Pseudo code end ----------

    Although waited_for_10seconds() becomes true once per 10 seconds,
    unbounded number of threads can call waited_for_10seconds() at the same
    time.  Also, since threads doing waited_for_10seconds() keep doing
    almost busy loop, the thread doing print_one_log() can use little CPU
    resource.  Therefore, this situation can be simplified like

    ---------- Pseudo code start ----------
      retry:
        if (mutex_trylock(&oom_lock)) {
          while (atomic_read(&printk_pending_logs) > 0) {
            atomic_dec(&printk_pending_logs);
            print_one_log();
          }
          // Send SIGKILL here.
          mutex_unlock(&oom_lock)
        } else {
          atomic_inc(&printk_pending_logs);
        }
        goto retry;
    ---------- Pseudo code end ----------

    when printk() is called faster than print_one_log() can process a log.

    One of possible mitigation would be to introduce a new lock in order to
    make sure that no other series of printk() (either oom_kill_process() or
    warn_alloc()) can append to printk() buffer when one series of printk()
    (either oom_kill_process() or warn_alloc()) is already in progress.

    Such serialization will also help obtaining kernel messages in readable
    form.

    ---------- Pseudo code start ----------
      retry:
        if (mutex_trylock(&oom_lock)) {
          mutex_lock(&oom_printk_lock);
          while (atomic_read(&printk_pending_logs) > 0) {
            atomic_dec(&printk_pending_logs);
            print_one_log();
          }
          // Send SIGKILL here.
          mutex_unlock(&oom_printk_lock);
          mutex_unlock(&oom_lock)
        } else {
          if (mutex_trylock(&oom_printk_lock)) {
            atomic_inc(&printk_pending_logs);
            mutex_unlock(&oom_printk_lock);
          }
        }
        goto retry;
    ---------- Pseudo code end ----------

    But this commit does not go that direction, for we don't want to
    introduce a new lock dependency, and we unlikely be able to obtain
    useful information even if we serialized oom_kill_process() and
    warn_alloc().

    Synchronous approach is prone to unexpected results (e.g.  too late [1],
    too frequent [2], overlooked [3]).  As far as I know, warn_alloc() never
    helped with providing information other than "something is going wrong".
    I want to consider asynchronous approach which can obtain information
    during stalls with possibly relevant threads (e.g.  the owner of
    oom_lock and kswapd-like threads) and serve as a trigger for actions
    (e.g.  turn on/off tracepoints, ask libvirt daemon to take a memory dump
    of stalling KVM guest for diagnostic purpose).

    This commit temporarily loses ability to report e.g.  OOM lockup due to
    unable to invoke the OOM killer due to !__GFP_FS allocation request.
    But asynchronous approach will be able to detect such situation and emit
    warning.  Thus, let's remove warn_alloc().

    [1] https://bugzilla.kernel.org/show_bug.cgi?id=192981
    [2] http://lkml.kernel.org/r/CAM_iQpWuPVGc2ky8M-9yukECtS+zKjiDasNymX7rMcBjBFyM_A@xxxxxxxxxxxxxx
    [3] commit db73ee0d46379922 ("mm, vmscan: do not loop on too_many_isolated for ever"))

    Link: http://lkml.kernel.org/r/1509017339-4802-1-git-send-email-penguin-kernel@xxxxxxxxxxxxxxxxxxx
    Signed-off-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
    Reported-by: Cong Wang <xiyou.wangcong@xxxxxxxxx>
    Reported-by: yuwang.yuwang <yuwang.yuwang@xxxxxxxxxxxxxxx>
    Reported-by: Johannes Weiner <hannes@xxxxxxxxxxx>
    Acked-by: Michal Hocko <mhocko@xxxxxxxx>
    Acked-by: Johannes Weiner <hannes@xxxxxxxxxxx>
    Cc: Vlastimil Babka <vbabka@xxxxxxx>
    Cc: Mel Gorman <mgorman@xxxxxxx>
    Cc: Dave Hansen <dave.hansen@xxxxxxxxx>
    Cc: Sergey Senozhatsky <sergey.senozhatsky@xxxxxxxxx>
    Cc: Petr Mladek <pmladek@xxxxxxxx>
    Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
    Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>

    Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
----------

----------
commit 4837fe37adff1d159904f0c013471b1ecbcb455e
Author: Michal Hocko <mhocko@xxxxxxxx>
Date:   Thu Dec 14 15:33:15 2017 -0800

    mm, oom_reaper: fix memory corruption

    David Rientjes has reported the following memory corruption while the
    oom reaper tries to unmap the victims address space

      BUG: Bad page map in process oom_reaper  pte:6353826300000000 pmd:00000000
      addr:00007f50cab1d000 vm_flags:08100073 anon_vma:ffff9eea335603f0 mapping:          (null) index:7f50cab1d
      file:          (null) fault:          (null) mmap:          (null) readpage:          (null)
      CPU: 2 PID: 1001 Comm: oom_reaper
      Call Trace:
         unmap_page_range+0x1068/0x1130
         __oom_reap_task_mm+0xd5/0x16b
         oom_reaper+0xff/0x14c
         kthread+0xc1/0xe0

    Tetsuo Handa has noticed that the synchronization inside exit_mmap is
    insufficient.  We only synchronize with the oom reaper if
    tsk_is_oom_victim which is not true if the final __mmput is called from
    a different context than the oom victim exit path.  This can trivially
    happen from context of any task which has grabbed mm reference (e.g.  to
    read /proc/<pid>/ file which requires mm etc.).

    The race would look like this

      oom_reaper                oom_victim              task
                                                mmget_not_zero
                        do_exit
                          mmput
      __oom_reap_task_mm                                mmput
                                                  __mmput
                                                    exit_mmap
                                                      remove_vma
        unmap_page_range

    Fix this issue by providing a new mm_is_oom_victim() helper which
    operates on the mm struct rather than a task.  Any context which
    operates on a remote mm struct should use this helper in place of
    tsk_is_oom_victim.  The flag is set in mark_oom_victim and never cleared
    so it is stable in the exit_mmap path.

    Debugged by Tetsuo Handa.

    Link: http://lkml.kernel.org/r/20171210095130.17110-1-mhocko@xxxxxxxxxx
    Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
    Signed-off-by: Michal Hocko <mhocko@xxxxxxxx>
    Reported-by: David Rientjes <rientjes@xxxxxxxxxx>
    Acked-by: David Rientjes <rientjes@xxxxxxxxxx>
    Cc: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
    Cc: Andrea Argangeli <andrea@xxxxxxxxxx>
    Cc: <stable@xxxxxxxxxxxxxxx>        [4.14]
    Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
    Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
----------

There is no reason not to make sure processes sharing victim's mm have
same treatment regarding use of memory reserves.

You want to get rid of TIF_MEMDIE flag, but you rejected my patch which is
one of steps for getting rid of TIF_MEMDIE flag.

Whether a problem is seen in real life is a catch-22 discussion, for you
keep refusing/ignoring to provide a method for allowing normal users to
tell whether they hit that problem.

Please stop rejecting my patches without thinking deeply.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>