+ mm-page_alloc-fix-check-for-null-preferred_zone.patch added to -mm tree

akpm@xxxxxxxxxxxxxxxxxxxx · Fri, 20 Jan 2017 15:17:11 -0800

The patch titled
     Subject: mm, page_alloc: fix check for NULL preferred_zone
has been added to the -mm tree.  Its filename is
     mm-page_alloc-fix-check-for-null-preferred_zone.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-page_alloc-fix-check-for-null-preferred_zone.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-page_alloc-fix-check-for-null-preferred_zone.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/SubmitChecklist when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Vlastimil Babka <vbabka@xxxxxxx>
Subject: mm, page_alloc: fix check for NULL preferred_zone

Patch series "fix premature OOM regression in 4.7+ due to cpuset races", v2.

This is v2 of my attempt to fix the recent report based on LTP cpuset
stress test [1].  The intention is to go to stable 4.9 LTSS with this, as
triggering repeated OOMs is not nice.  That's why the patches try to be
not too intrusive.

Unfortunately why investigating I found that modifying the testcase to use
per-VMA policies instead of per-task policies will bring the OOM's back,
but that seems to be much older and harder to fix problem.  I have posted
a RFC [2] but I believe that fixing the recent regressions has a higher
priority.

Longer-term we might try to think how to fix the cpuset mess in a better
and less error prone way.  I was for example very surprised to learn, that
cpuset updates change not only task->mems_allowed, but also nodemask of
mempolicies.  Until now I expected the parameter to alloc_pages_nodemask()
to be stable.  I wonder why do we then treat cpusets specially in
get_page_from_freelist() and distinguish HARDWALL etc, when there's
unconditional intersection between mempolicy and cpuset.  I would expect
the nodemask adjustment for saving overhead in g_p_f(), but that clearly
doesn't happen in the current form.  So we have both crazy complexity and
overhead, AFAICS.

[1] https://lkml.kernel.org/r/CAFpQJXUq-JuEP=QPidy4p_=FN0rkH5Z-kfB4qBvsf6jMS87Edg@xxxxxxxxxxxxxx
[2] https://lkml.kernel.org/r/7c459f26-13a6-a817-e508-b65b903a8378@xxxxxxx


This patch (of 4):

Since commit c33d6c06f60f ("mm, page_alloc: avoid looking up the first
zone in a zonelist twice") we have a wrong check for NULL preferred_zone,
which can theoretically happen due to concurrent cpuset modification.  We
check the zoneref pointer which is never NULL and we should check the zone
pointer.  Also document this in first_zones_zonelist() comment per Michal
Hocko.

Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a zonelist twice")
Link: http://lkml.kernel.org/r/20170120103843.24587-2-vbabka@xxxxxxx
Signed-off-by: Vlastimil Babka <vbabka@xxxxxxx>
Acked-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
Cc: Ganapatrao Kulkarni <gpkulkarni@xxxxxxxxx>
Cc: Michal Hocko <mhocko@xxxxxxxx>
Cc: <stable@xxxxxxxxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---

 include/linux/mmzone.h |    6 +++++-
 mm/page_alloc.c        |    2 +-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff -puN include/linux/mmzone.h~mm-page_alloc-fix-check-for-null-preferred_zone include/linux/mmzone.h

--- a/include/linux/mmzone.h~mm-page_alloc-fix-check-for-null-preferred_zone
+++ a/include/linux/mmzone.h
@@ -972,12 +972,16 @@ static __always_inline struct zoneref *n
  * @zonelist - The zonelist to search for a suitable zone
  * @highest_zoneidx - The zone index of the highest zone to return
  * @nodes - An optional nodemask to filter the zonelist with
- * @zone - The first suitable zone found is returned via this parameter
+ * @return - Zoneref pointer for the first suitable zone found (see below)
  *
  * This function returns the first zone at or below a given zone index that is
  * within the allowed nodemask. The zoneref returned is a cursor that can be
  * used to iterate the zonelist with next_zones_zonelist by advancing it by
  * one before calling.
+ *
+ * When no eligible zone is found, zoneref->zone is NULL (zoneref itself is
+ * never NULL). This may happen either genuinely, or due to concurrent nodemask
+ * update due to cpuset modification.
  */
 static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
 					enum zone_type highest_zoneidx,
diff -puN mm/page_alloc.c~mm-page_alloc-fix-check-for-null-preferred_zone mm/page_alloc.c
--- a/mm/page_alloc.c~mm-page_alloc-fix-check-for-null-preferred_zone
+++ a/mm/page_alloc.c
@@ -3784,7 +3784,7 @@ retry_cpuset:
 	 */
 	ac.preferred_zoneref = first_zones_zonelist(ac.zonelist,
 					ac.high_zoneidx, ac.nodemask);
-	if (!ac.preferred_zoneref) {
+	if (!ac.preferred_zoneref->zone) {
 		page = NULL;
 		goto no_zone;
 	}
_

Patches currently in -mm which might be from vbabka@xxxxxxx are

mm-mempolicyc-do-not-put-mempolicy-before-using-its-nodemask.patch
mm-page_alloc-fix-check-for-null-preferred_zone.patch
mm-page_alloc-fix-fast-path-race-with-cpuset-update-or-removal.patch
mm-page_alloc-move-cpuset-seqcount-checking-to-slowpath.patch
mm-page_alloc-fix-premature-oom-when-racing-with-cpuset-mems-update.patch
mm-page_alloc-dont-convert-pfn-to-idx-when-merging.patch
mm-page_alloc-avoid-page_to_pfn-when-merging-buddies.patch

--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html