+ mm-dont-raise-memcg_oom-event-due-to-failed-high-order-allocation.patch added to -mm tree

akpm@xxxxxxxxxxxxxxxxxxxx · Fri, 05 Oct 2018 12:55:49 -0700

The patch titled
     Subject: mm: don't raise MEMCG_OOM event due to failed high-order allocation
has been added to the -mm tree.  Its filename is
     mm-dont-raise-memcg_oom-event-due-to-failed-high-order-allocation.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-dont-raise-memcg_oom-event-due-to-failed-high-order-allocation.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-dont-raise-memcg_oom-event-due-to-failed-high-order-allocation.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Roman Gushchin <guro@xxxxxx>
Subject: mm: don't raise MEMCG_OOM event due to failed high-order allocation

It was reported that on some of our machines containers were restarted
with OOM symptoms without an obvious reason.  Despite there were almost no
memory pressure and plenty of page cache, MEMCG_OOM event was raised
occasionally, causing the container management software to think, that OOM
has happened.  However, no tasks have been killed.

The following investigation showed that the problem is caused by a failing
attempt to charge a high-order page.  In such case, the OOM killer is
never invoked.  As shown below, it can happen under conditions, which are
very far from a real OOM: e.g.  there is plenty of clean page cache and no
memory pressure.

There is no sense in raising an OOM event in this case, as it might
confuse a user and lead to wrong and excessive actions (e.g.  restart the
workload, as in my case).

Let's look at the charging path in try_charge().  If the memory usage is
about memory.max, which is absolutely natural for most memory cgroups, we
try to reclaim some pages.  Even if we were able to reclaim enough memory
for the allocation, the following check can fail due to a race with
another concurrent allocation:

    if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
        goto retry;

For regular pages the following condition will save us from triggering
the OOM:

   if (nr_reclaimed && nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER))
       goto retry;

But for high-order allocation this condition will intentionally fail.  The
reason behind is that we'll likely fall to regular pages anyway, so it's
ok and even preferred to return ENOMEM.

In this case the idea of raising MEMCG_OOM looks dubious.

Fix this by moving MEMCG_OOM raising to mem_cgroup_oom() after allocation
order check, so that the event won't be raised for high order allocations.
This change doesn't affect regular pages allocation and charging.

Link: http://lkml.kernel.org/r/20181004214050.7417-1-guro@xxxxxx
Signed-off-by: Roman Gushchin <guro@xxxxxx>
Acked-by: David Rientjes <rientjes@xxxxxxxxxx>
Acked-by: Michal Hocko <mhocko@xxxxxxxxxx>
Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
Cc: Vladimir Davydov <vdavydov.dev@xxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---

 Documentation/admin-guide/cgroup-v2.rst |    4 ++++
 mm/memcontrol.c                         |    4 ++--
 2 files changed, 6 insertions(+), 2 deletions(-)

--- a/Documentation/admin-guide/cgroup-v2.rst~mm-dont-raise-memcg_oom-event-due-to-failed-high-order-allocation
+++ a/Documentation/admin-guide/cgroup-v2.rst
@@ -1133,6 +1133,10 @@ PAGE_SIZE multiple when read back.
 		disk readahead.  For now OOM in memory cgroup kills
 		tasks iff shortage has happened inside page fault.
 
+		This event is not raised if the OOM killer is not
+		considered as an option, e.g. for failed high-order
+		allocations.
+
 	  oom_kill
 		The number of processes belonging to this cgroup
 		killed by any kind of OOM killer.
--- a/mm/memcontrol.c~mm-dont-raise-memcg_oom-event-due-to-failed-high-order-allocation
+++ a/mm/memcontrol.c
@@ -1669,6 +1669,8 @@ static enum oom_status mem_cgroup_oom(st
 	if (order > PAGE_ALLOC_COSTLY_ORDER)
 		return OOM_SKIPPED;
 
+	memcg_memory_event(memcg, MEMCG_OOM);
+
 	/*
 	 * We are in the middle of the charge context here, so we
 	 * don't want to block when potentially sitting on a callstack
@@ -2250,8 +2252,6 @@ retry:
 	if (fatal_signal_pending(current))
 		goto force;
 
-	memcg_memory_event(mem_over_limit, MEMCG_OOM);
-
 	/*
 	 * keep retrying as long as the memcg oom killer is able to make
 	 * a forward progress or bypass the charge if the oom killer
_

Patches currently in -mm which might be from guro@xxxxxx are

mm-rework-memcg-kernel-stack-accounting.patch
mm-drain-memcg-stocks-on-css-offlining.patch
mm-dont-miss-the-last-page-because-of-round-off-error.patch
mm-dont-miss-the-last-page-because-of-round-off-error-fix.patch
mm-dont-raise-memcg_oom-event-due-to-failed-high-order-allocation.patch