[PATCH] mm,page_alloc: Allow !__GFP_FS allocations to invoke the OOM killer

Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> · Fri, 23 Sep 2016 00:22:57 +0900

Historically we did not invoke the OOM killer for small !__GFP_FS memory
allocation requests when such allocation requests failed to make progress,
but we keep such allocation requests retry inside the page allocator by
telling a lie that some progress was made rather than make such allocation
requests fail. Such behavior might lead to silent OOM livelock situation
where nobody can invoke the OOM killer, but making such allocation requests
fail is inadvisable because it led to significant loss of reliability as
shown by http://lkml.kernel.org/r/201502202020.BGG05734.FJOSMLtHFQOFOV@xxxxxxxxxxxxxxxxxxx
and workarounded by commit cc87317726f85153 ("mm: page_alloc: revert
inadvertent !__GFP_FS retry behavior change").

There is a way to stop telling the lie by explicitly specifying either
__GFP_NOFAIL or __GFP_NORETRY to !__GFP_FS memory allocation requests.
We actually added __GFP_NOFAIL to some of such allocation requests as shown
by http://lkml.kernel.org/r/1438768284-30927-1-git-send-email-mhocko@xxxxxxxxxx ,
but I did not agree with making !__GFP_NOFAIL !__GFP_FS memory allocation
requests fail.



I think that it is a too much assertive way to require all !__GFP_FS
memory allocation requests which can return an error code to userspace
processes to specify either __GFP_NOFAIL or __GFP_NORETRY (but making
!__GFP_NOFAIL !__GFP_FS memory allocation requests fail by default is
as well a too much assertive way).

In Linux, I think that the existence of the OOM killer and oom_score_adj
governs behavior when memory is exhausted, and specifying __GFP_NORETRY
goes against expectations controlled by oom_score_adj. In most cases,
userspace processes will terminate upon unexpected ENOMEM error. Such
consequence is not so different from killing some userspace process via
invoking the OOM killer. It is not desirable to return ENOMEM error to
userspace processes because an !__GFP_FS allocation request failed.
It is not a problem of userspace processes but the kernel's convenience
that whether memory allocation request which caused ENOMEM error was
__GFP_FS or not. It is annoying that an OOM unkillable userspace process
unexpectedly terminates rather than the OOM killer kills a process which
is most suitable for being OOM-killed.

Also, regarding !__GFP_FS memory allocation requests which cannot return
an error code to userspace, it is too late to recover as soon as such
allocation requests fail. It is sad that delayed writes (buffered I/O) are
lost simply due to the kernel's memory management's convenience. It will
be a significant loss of performance that userspace processes are asked
to use fsync() (or not to use delayed writes) for their self-defense
in case of system-wide OOM events.

Therefore, for userspace processes, allowing !__GFP_FS memory allocation
requests to invoke the OOM killer will be least painful approach.



Since most of memory allocation requests include __GFP_KSWAPD_RECLAIM,
kswapd will be woken up and kswapd will do __GFP_FS reclaim in the
background. Thus, effectively we can assume as if somebody is doing
__GFP_FS memory allocation request as long as !__GFP_FS memory allocation
requests are looping inside the page allocator. However, this assumption
depends on that somebody can invoke the OOM killer when nobody can reclaim
memory.

__GFP_FS memory allocation requests might wait for !__GFP_FS memory
allocation requests. For example, memory allocation requests are blocked
at too_many_isolated() from shrink_inactive_list() while kswapd is
blocked on fs locks waiting for fs writeback. Since the threshold of
too_many_isolated() for __GFP_FS memory allocation requests and !__GFP_FS
memory allocation requests differ, it is possible that only !__GFP_FS
memory allocation requests can arrive at __alloc_pages_may_oom() whereas
__GFP_FS memory allocation requests are blocked at too_many_isolated().
Therefore, the value of __GFP_FS's ability to invoke the OOM killer will be
lost unless it is guaranteed that !__GFP_FS memory allocation requests
are guaranteed to be able to make forward progress. Like explained above,
it is annoying thing for userspace processes that !__GFP_FS memory allocation
requests fail.

If I understand http://lkml.kernel.org/r/20150812091104.GA14940@xxxxxxxxxxxxxx
correctly, currently not having a way to determine whether somebody else
can make progress via __GFP_FS reclaim is the reason not to invoke the
OOM killer. But regarding the OOM killer/reaper, we are eliminating
locations which may fall into OOM livelock (e.g.
http://lkml.kernel.org/r/201602171930.AII18204.FMOSVFQFOJtLOH@xxxxxxxxxxxxxxxxxxx ).
As a result, __GFP_FS check in out_of_memory() is the last location
which may fall into OOM livelock after out_of_memory() is called.



It is not clean that __GFP_FS has a role of allowing invoking the OOM
killer when __GFP_NOFAIL is not included. I think that __GFP_FS should be
independent with whether to invoke the OOM killer. Regarding behavior after
the OOM killer is invoked, (though CONFIG_MMU=y kernels only) we can now
guarantee forward progress. Thus, if we allow !__GFP_FS memory allocation
requests to invoke the OOM killer, (though CONFIG_MMU=y kernels only) we
can guarantee forward progress and eliminate possibility of silent OOM
livelock.

As a first step, I do want to eliminate possibility of silent OOM livelock.
If this patch causes !__GFP_FS memory allocation requests to invoke the
OOM killer trivially, at least we will be able to emit warning messages
periodically as long as we are telling the lie instead of invoking the
OOM killer. Without knowing which caller is falling into OOM livelock,
we will remain too cowardly to determine when we can stop telling the
lie and we will bother administrators with silent OOM livelock.

Signed-off-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
Cc: Michal Hocko <mhocko@xxxxxxx>
Cc: David Rientjes <rientjes@xxxxxxxxxx>
Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
---
 mm/oom_kill.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f284e92..7893c5c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1005,15 +1005,6 @@ bool out_of_memory(struct oom_control *oc)
 	}
 
 	/*
-	 * The OOM killer does not compensate for IO-less reclaim.
-	 * pagefault_out_of_memory lost its gfp context so we have to
-	 * make sure exclude 0 mask - all other users should have at least
-	 * ___GFP_DIRECT_RECLAIM to get here.
-	 */
-	if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL)))
-		return true;
-
-	/*
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA and memcg) that may require different handling.
 	 */
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>