Re: __GFP_NOFAIL and oom_killer_disabled?

Michal Hocko <mhocko@xxxxxxx> · Mon, 23 Feb 2015 11:21:47 +0100

On Sun 22-02-15 23:48:01, Tetsuo Handa wrote:
> Andrew Morton wrote:
> > And yes, I agree that sites such as xfs's kmem_alloc() should be
> > passing __GFP_NOFAIL to tell the page allocator what's going on.  I
> > don't think it matters a lot whether kmem_alloc() retains its retry
> > loop.  If __GFP_NOFAIL is working correctly then it will never loop
> > anyway...
> 
> __GFP_NOFAIL fails to work correctly if oom_killer_disabled == true.
> I'm wondering how oom_killer_disable() interferes with __GFP_NOFAIL
> allocation. We had race check after setting oom_killer_disabled to true
> in 3.19.
[...]
> I worry that commit c32b3cbe0d067a9c "oom, PM: make OOM detection in
> the freezer path raceless" might have opened a race window for
> __alloc_pages_may_oom(__GFP_NOFAIL) allocation to fail when OOM killer
> is disabled.

This commit hasn't introduced any behavior changes. GFP_NOFAIL
allocations fail when OOM killer is disabled since beginning
7f33d49a2ed5 (mm, PM/Freezer: Disable OOM killer when tasks are frozen).

> I think something like
> 
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -789,7 +789,7 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  	bool ret = false;
>  
>  	down_read(&oom_sem);
> -	if (!oom_killer_disabled) {
> +	if (!oom_killer_disabled || (gfp_mask & __GFP_NOFAIL)) {
>  		__out_of_memory(zonelist, gfp_mask, order, nodemask, force_kill);
>  		ret = true;
>  	}
> 
> is needed.

> But such change can race with up_write() and wait_event() in
> oom_killer_disable(). 

Not only it races with the above but also breaks the core assumption
that no userspace task might interact with later stages of the suspend.

> While the comment of oom_killer_disable() says
> "The function cannot be called when there are runnable user tasks because
> the userspace would see unexpected allocation failures as a result.",
> aren't there still kernel threads which might do __GFP_NOFAIL allocations?

OK, this is a fair point. My assumption was that kernel threads rarely
do __GFP_NOFAIL allocations. It seems I was wrong here. This makes the
logic much more trickier. I can see 3 possible ways to handle this:

1) move oom_killer_disable after kernel threads are frozen. This has a
   risk that the OOM victim wouldn't be able to finish because it would
   depend on an already frozen kernel thread. This would be really
   tricky to debug.
2) do not fail GFP_NOFAIL allocation no matter what and risk a potential
   (and silent) endless loop during suspend. On the other hand the
   chances that __GFP_NOFAIL comes from a freezable kernel thread rather
   than from deep pm suspend path is considerably higher.
   So now that I am thinking about that it indeed makes more sense to
   simply warn when OOM is disabled and retry the allocation. Freezable
   kernel threads will loop and fail the suspend. Incidental allocations
   after kernel threads are frozen will at least dump a warning - if we
   are lucky and the serial console is still active of course...
3) do nothing ;)

But whatever we do there is simply no way to guarantee __GFP_NOFAIL
after OOM killer has been disabled. So we are risking between endless
loops and possible crashes due to unexpected allocation failures. Not a
nice choice. We can only chose the less risky way and it sounds like 2)
is that option. Considering that we haven't seen any crashes with the
current behavior I would be tempted to simply declare this a corner case
which doesn't need any action but well, I hate to debug nasty issues so
better be prepared...

What about something like the following?
---
>From 35f6eb9b3ae6743cf43e0721c16a52cc8c223fa4 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@xxxxxxx>
Date: Mon, 23 Feb 2015 10:33:30 +0100
Subject: [PATCH] mm, oom: do not fail __GFP_NOFAaIL allocation if oom killer
 is disbaled

Tetsuo Handa has pointed out that __GFP_NOFAIL allocations might fail
after OOM killer is disabled if the allocation is performed by a
kernel thread. This behavior was introduced from the very beginning by
7f33d49a2ed5 (mm, PM/Freezer: Disable OOM killer when tasks are frozen).
This means that the basic contract for the allocation request is broken
and the context requesting such an allocation might blow up
unexpectedly.

There are basically two ways forward.
1) move oom_killer_disable after kernel threads are frozen. This has a
   risk that the OOM victim wouldn't be able to finish because it would
   depend on an already frozen kernel thread. This would be really
   tricky to debug.
2) do not fail GFP_NOFAIL allocation no matter what and risk a potential
   Freezable kernel threads will loop and fail the suspend. Incidental
   allocations after kernel threads are frozen will at least dump a
   warning - if we are lucky and the serial console is still active of
   course...

This patch implements the later option because it is safer. We would see
warnings rather than allocation failures for the kernel threads which
would blow up otherwise and have a higher chances to identify
__GFP_NOFAIL users from deeper pm code.

Signed-off-by: Michal Hocko <mhocko@xxxxxxx>
---
 mm/oom_kill.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 642f38cb175a..ea8b443cd871 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -772,6 +772,10 @@ out:
 		schedule_timeout_killable(1);
 }
 
+static DEFINE_RATELIMIT_STATE(oom_disabled_rs,
+		DEFAULT_RATELIMIT_INTERVAL,
+		DEFAULT_RATELIMIT_BURST);
+
 /**
  * out_of_memory -  tries to invoke OOM killer.
  * @zonelist: zonelist pointer
@@ -792,6 +796,10 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	if (!oom_killer_disabled) {
 		__out_of_memory(zonelist, gfp_mask, order, nodemask, force_kill);
 		ret = true;
+	} else if (gfp_mask & __GFP_NOFAIL) {
+		if (__ratelimit(&oom_disabled_rs))
+			WARN(1, "Unable to make forward progress for __GFP_NOFAIL because OOM killer is disbaled\n");
+		ret = true;
 	}
 	up_read(&oom_sem);
 
-- 
2.1.4



> After all, don't we need to recheck after setting oom_killer_disabled to true?

Recheck what?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>