RFC for small allocation failure mode transition plan (was: Re: [Lsf] common session about page allocator vs. FS/IO) It's time to put together the schedule)

Michal Hocko <mhocko@xxxxxxx> · Sun, 8 Mar 2015 14:11:29 -0400

On Mon 02-03-15 07:44:54, James Bottomley wrote:
> On Mon, 2015-03-02 at 10:41 -0500, Jeff Layton wrote:
> > On Mon, 2 Mar 2015 16:28:58 +0100
> > Michal Hocko <mhocko@xxxxxxx> wrote:
> > 
> > > [Let's add people from the discussion on the CC]
> > > 
> > > On Mon 02-03-15 07:26:33, James Bottomley wrote:
> > > > On Mon, 2015-03-02 at 16:19 +0100, Michal Hocko wrote:
> > > > > On Mon 23-02-15 18:08:42, Michal Hocko wrote:
[...]
> > > > > > I would like to propose a common session (FS and MM, maybe IO as well)
> > > > > > about memory allocator guarantees and the current behavior of the
> > > > > > page allocator with different gfp flags - GFP_KERNEL being basically
> > > > > > __GFP_NOFAIL for small allocations. __GFP_NOFAIL allocations in general
> > > > > > - how they are used in fs/io code paths and what can the allocator do
> > > > > > to prevent from memory exhaustion. GFP_NOFS behavior when combined with
> > > > > > __GFP_NOFAIL and so on. It seems there was a disconnection between mm
> > > > > > and fs people and one camp is not fully aware of what others are doing
> > > > > > and why as it turned out during recent discussions.
> > > > > 
> > > > > James do you have any plans to put this on the schedule?
> > > > 
> > > > I was waiting to see if there was any other feedback, but if you feel
> > > > strongly it should happen, I can do it.
> > > 
> > > I think it would be helpful, but let's see what other involved in the
> > > discussion think.
> > 
> > It makes sense to me as a plenary discussion.
> > 
> > I was personally quite surprised to hear that small allocations
> > couldn't fail, and dismayed at how much time I've spent writing dead
> > error handling code. ;)
> > 
> > If we're keen to get rid of that behavior (and I think it really ought
> > to go, IMNSHO), then what might make sense is to add a Kconfig switch
> > that allows small allocations to fail as an interim step and see what
> > breaks when it's enabled.
> > 
> > Once we fix all of those places up, then we can see about getting
> > distros to turn it on, and eventually eliminate the Kconfig switch
> > altogether. It'll take a few years, but that's probably the least
> > disruptive approach.
> 
> OK, your wish is my command: it's filled up the last empty plenary slot
> on Monday morning.

I guess the following RFC patch should be good for the first part of the
topic - Small allocations implying __GFP_NOFAIL currently. I am CCing
linux-mm mailing list as well so that people not attending LSF/MM can
comment on the approach.

I hope people will find time to look at it before the session because I
am afraid two topics per one slot will be too dense otherwise. I also
hope this part will be less controversial and the primary point for
discussion will be on HOW TO GET RID OF the current behavior in a sane
way rather than WHY TO KEEP IT.
---
>From cc3488d68d0c7839a92c9772a1420bb666f00c35 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@xxxxxxx>
Date: Sun, 8 Mar 2015 07:37:36 -0400
Subject: [PATCH] mm: Allow small allocations to fail

It's been ages since small allocations basically imply __GFP_NOFAIL
behavior (with some nuance - e.g. allocation might fail if the current
task is an OOM victim). The idea at the time was that the OOM killer
will make sufficient progress to allow to make progress and so the
small allocation will succeed in the end. This assumption is flawed,
though, because the retrying allocation might be blocking a resource
(e.g. a lock) which might prevent the OOM killer victim from making
progress and so the system is basically deadlocked.

Another aspect is that this behavior makes it extremely hard to make
any allocation failure policy implementation at the allocation caller
level impossible. Most of the allocation paths already check the for the
allocation failure and handle it properly.

There are some places which BUG_ON failure (mostly early boot code) and
they do not have to be changed. It is better to see a panic rather than
a silent freeze of the machine.

Finally, if a non-failing allocation is unavoidable then __GFP_NOFAIL
flag should be used.

As this behavior is established for many years we cannot change it
immediately. This patch instead exports a new sysctl/proc knob which
tells allocator how much to retry. The higher the number the longer will
the allocator loop and try to trigger OOM killer when the memory is too
low. This implementation counts only those retries which involved OOM
killer because we do not want to be too eager to fail the request.

The default value is -1 which preserves the current behavior (endless
retries). The idea is that we start with testing systems first and
lower the value to catch potential fallouts (crashes due to unchecked
failures or other misbehavior like FS ro-remounts etc...).  We can try
to encourage distributions to change the default in the second step
so that we get a much bigger exposure. And finally we can change the
default in the kernel while still keeping the knob for conservative
configurations.

Signed-off-by: Michal Hocko <mhocko@xxxxxxx>
---
 Documentation/sysctl/vm.txt | 24 ++++++++++++++++++++++++
 include/linux/mm.h          |  2 ++
 kernel/sysctl.c             |  7 +++++++
 mm/page_alloc.c             | 28 ++++++++++++++++++++++------
 4 files changed, 55 insertions(+), 6 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index e9c706e4627a..bd4b4dd87beb 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -53,6 +53,7 @@ Currently, these files are in /proc/sys/vm:
 - page-cluster
 - panic_on_oom
 - percpu_pagelist_fraction
+- retry_allocation_attempts
 - stat_interval
 - swappiness
 - user_reserve_kbytes
@@ -707,6 +708,29 @@ sysctl, it will revert to this default behavior.
 
 ==============================================================
 
+retry_allocation_attempts
+
+Page allocator tries hard to not fail small allocations requests.
+Currently it retries indefinitely for small allocations requests (<= 32kB).
+This works mostly fine but under an extreme low memory conditions system
+might end up in deadlock situations because the looping allocation
+request might block further progress for OOM killer victims.
+
+Even though this hasn't turned out to be a huge problem for many years the
+long term plan is to move away from this default behavior but as this is
+a long established behavior we cannot change it immediately.
+
+This knob should help in the transition. It tells how many times should
+allocator retry when the system is OOM before the allocation fails.
+The default value (-1) preserves the old behavior. This is a safe
+default for production systems which cannot afford any unexpected
+downtimes. More experimental systems might set it to a small number
+(e.g. 10). A higher number would make allocation failures less probable
+and still allow for OOM killer to make a progress (the locked up state
+would take longer than with the lower values though).
+
+==============================================================
+
 stat_interval
 
 The time interval between which vm statistics are updated.  The default
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b720b5146a4e..e3b42f46e743 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -75,6 +75,8 @@ extern int sysctl_overcommit_memory;
 extern int sysctl_overcommit_ratio;
 extern unsigned long sysctl_overcommit_kbytes;
 
+extern unsigned long sysctl_nr_alloc_retry;
+
 extern int overcommit_ratio_handler(struct ctl_table *, int, void __user *,
 				    size_t *, loff_t *);
 extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *,
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 88ea2d6e0031..25c7ed1265ac 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1499,6 +1499,13 @@ static struct ctl_table vm_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_doulongvec_minmax,
 	},
+	{
+		.procname	= "retry_allocation_attempts",
+		.data		= &sysctl_nr_alloc_retry,
+		.maxlen		= sizeof(sysctl_nr_retry),
+		.mode		= 0644,
+		.proc_handler	= proc_doulongvec_minmax,
+	},
 	{ }
 };
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 58f6cf5bdde2..2410d4475fe8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -123,6 +123,17 @@ unsigned long dirty_balance_reserve __read_mostly;
 int percpu_pagelist_fraction;
 gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
 
+/*
+ * Number of allocation retries after the system is considered OOM.
+ * We have been retrying indefinitely for low order allocations for
+ * a very long time and this sysctl should help us to move away from
+ * this behavior because it complicates low memory conditions handling.
+ * The current default is preserving the behavior but non-critical
+ * environments are encouraged to lower the value to catch potential
+ * issues which should be fixed.
+ */
+unsigned long sysctl_nr_alloc_retry = -1UL;
+
 #ifdef CONFIG_PM_SLEEP
 /*
  * The following functions are used by the suspend/hibernate code to temporarily
@@ -2322,7 +2333,8 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
 static inline int
 should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 				unsigned long did_some_progress,
-				unsigned long pages_reclaimed)
+				unsigned long pages_reclaimed,
+				unsigned long nr_retries)
 {
 	/* Do not loop if specifically requested */
 	if (gfp_mask & __GFP_NORETRY)
@@ -2342,11 +2354,12 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 
 	/*
 	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
-	 * means __GFP_NOFAIL, but that may not be true in other
-	 * implementations.
+	 * retries allocations as per global configuration which might
+	 * also be indefinitely.
 	 */
-	if (order <= PAGE_ALLOC_COSTLY_ORDER)
-		return 1;
+	if (order <= PAGE_ALLOC_COSTLY_ORDER &&
+			nr_retries < sysctl_nr_alloc_retry)
+			return 1;
 
 	/*
 	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
@@ -2651,6 +2664,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
 	bool deferred_compaction = false;
 	int contended_compaction = COMPACT_CONTENDED_NONE;
+	unsigned long nr_retries = 0;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -2794,7 +2808,7 @@ retry:
 	/* Check if we should retry the allocation */
 	pages_reclaimed += did_some_progress;
 	if (should_alloc_retry(gfp_mask, order, did_some_progress,
-						pages_reclaimed)) {
+			       pages_reclaimed, nr_retries)) {
 		/*
 		 * If we fail to make progress by freeing individual
 		 * pages, but the allocation wants us to keep going,
@@ -2807,6 +2821,8 @@ retry:
 				goto got_pg;
 			if (!did_some_progress)
 				goto nopage;
+
+			nr_retries++;
 		}
 		/* Wait for some write requests to complete then retry */
 		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
-- 
2.1.4


-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>