+ mm-hugetlb-allow-hugepage-allocations-to-excessively-reclaim.patch added to -mm tree

akpm@xxxxxxxxxxxxxxxxxxxx · Mon, 07 Oct 2019 15:03:38 -0700

The patch titled
     Subject: mm, hugetlb: allow hugepage allocations to excessively reclaim
has been added to the -mm tree.  Its filename is
     mm-hugetlb-allow-hugepage-allocations-to-excessively-reclaim.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-hugetlb-allow-hugepage-allocations-to-excessively-reclaim.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-hugetlb-allow-hugepage-allocations-to-excessively-reclaim.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: David Rientjes <rientjes@xxxxxxxxxx>
Subject: mm, hugetlb: allow hugepage allocations to excessively reclaim

b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction may
not succeed") has chnaged the allocator to bail out from the allocator
early to prevent from a potentially excessive memory reclaim. 
__GFP_RETRY_MAYFAIL is designed to retry the allocation, reclaim and
compaction loop as long as there is a reasonable chance to make forward
progress.  Neither COMPACT_SKIPPED nor COMPACT_DEFERRED at the
INIT_COMPACT_PRIORITY compaction attempt gives this feedback.

The most obvious affected subsystem is hugetlbfs which allocates huge
pages based on an admin request (or via admin configured overcommit).  I
have done a simple test which tries to allocate half of the memory for
hugetlb pages while the memory is full of a clean page cache.  This is not
an unusual situation because we try to cache as much of the memory as
possible and sysctl/sysfs interface to allocate huge pages is there for
flexibility to allocate hugetlb pages at any time.

System has 1GB of RAM and we are requesting 515MB worth of hugetlb pages
after the memory is prefilled by a clean page cache:
root@test1:~# cat hugetlb_test.sh

set -x
echo 0 > /proc/sys/vm/nr_hugepages
echo 3 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/compact_memory
dd if=/mnt/data/file-1G of=/dev/null bs=$((4<<10))
TS=$(date +%s)
echo 256 > /proc/sys/vm/nr_hugepages
cat /proc/sys/vm/nr_hugepages

The results for 2 consecutive runs on clean 5.3
root@test1:~# sh hugetlb_test.sh
+ echo 0
+ echo 3
+ echo 1
+ dd if=/mnt/data/file-1G of=/dev/null bs=4096
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB) copied, 21.0694 s, 51.0 MB/s
+ date +%s
+ TS=1569905284
+ echo 256
+ cat /proc/sys/vm/nr_hugepages
256
root@test1:~# sh hugetlb_test.sh
+ echo 0
+ echo 3
+ echo 1
+ dd if=/mnt/data/file-1G of=/dev/null bs=4096
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB) copied, 21.7548 s, 49.4 MB/s
+ date +%s
+ TS=1569905311
+ echo 256
+ cat /proc/sys/vm/nr_hugepages
256

Now with b39d0ee2632d applied
root@test1:~# sh hugetlb_test.sh
+ echo 0
+ echo 3
+ echo 1
+ dd if=/mnt/data/file-1G of=/dev/null bs=4096
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB) copied, 20.1815 s, 53.2 MB/s
+ date +%s
+ TS=1569905516
+ echo 256
+ cat /proc/sys/vm/nr_hugepages
11
root@test1:~# sh hugetlb_test.sh
+ echo 0
+ echo 3
+ echo 1
+ dd if=/mnt/data/file-1G of=/dev/null bs=4096
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB) copied, 21.9485 s, 48.9 MB/s
+ date +%s
+ TS=1569905541
+ echo 256
+ cat /proc/sys/vm/nr_hugepages
12

The success rate went down by factor of 20!

Although hugetlb allocation requests might fail and it is reasonable to
expect them to under extremely fragmented memory or when the memory is
under a heavy pressure but the above situation is not that case.

Fix the regression by reverting back to the previous behavior for
__GFP_RETRY_MAYFAIL requests and disable the beail out heuristic for those
requests.

Mike said:

: hugetlbfs allocations are commonly done via sysctl/sysfs shortly after
: boot where this may not be as much of an issue.  However, I am aware of at
: least three use cases where allocations are made after the system has been
: up and running for quite some time:
: 
: - DB reconfiguration.  If sysctl/sysfs fails to get required number of
:   huge pages, system is rebooted to perform allocation after boot.
: 
: - VM provisioning.  If unable get required number of huge pages, fall
:   back to base pages.
: 
: - An application that does not preallocate pool, but rather allocates
:   pages at fault time for optimal NUMA locality.
: 
: In all cases, I would expect b39d0ee2632d to cause regressions and
: noticable behavior changes.
:
: My quick/limited testing in
: https://lkml.kernel.org/r/3468b605-a3a9-6978-9699-57c52a90bd7e@xxxxxxxxxx
: was insufficient.  It was also mentioned that if something like
: b39d0ee2632d went forward, I would like exemptions for __GFP_RETRY_MAYFAIL
: requests as in this patch.

[mhocko@xxxxxxxx: reworded changelog]
Link: http://lkml.kernel.org/r/20191007075548.12456-1-mhocko@xxxxxxxxxx
Fixes: b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed")
Signed-off-by: David Rientjes <rientjes@xxxxxxxxxx>
Signed-off-by: Michal Hocko <mhocko@xxxxxxxx>
Reviewed-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Cc: Vlastimil Babka <vbabka@xxxxxxx>
Cc: Mel Gorman <mgorman@xxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---

 mm/page_alloc.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

--- a/mm/page_alloc.c~mm-hugetlb-allow-hugepage-allocations-to-excessively-reclaim
+++ a/mm/page_alloc.c
@@ -4473,12 +4473,14 @@ retry_cpuset:
 		if (page)
 			goto got_pg;
 
-		 if (order >= pageblock_order && (gfp_mask & __GFP_IO)) {
+		 if (order >= pageblock_order && (gfp_mask & __GFP_IO) &&
+		     !(gfp_mask & __GFP_RETRY_MAYFAIL)) {
 			/*
 			 * If allocating entire pageblock(s) and compaction
 			 * failed because all zones are below low watermarks
 			 * or is prohibited because it recently failed at this
-			 * order, fail immediately.
+			 * order, fail immediately unless the allocator has
+			 * requested compaction and reclaim retry.
 			 *
 			 * Reclaim is
 			 *  - potentially very expensive because zones are far
_

Patches currently in -mm which might be from rientjes@xxxxxxxxxx are

mm-hugetlb-allow-hugepage-allocations-to-excessively-reclaim.patch