Re: oom: How to handle !__GFP_FS exception?

David Rientjes <rientjes@xxxxxxxxxx> · Tue, 9 Jun 2015 15:41:54 -0700 (PDT)

On Tue, 9 Jun 2015, Tetsuo Handa wrote:

> > The !__GFP_FS exception is historic because the oom killer would trigger 
> > waaay too often if it were removed simply because it doesn't have a great 
> > chance of allowing reclaim to succeed.  Instead, we rely on external 
> > memory freeing or other parallel allocators being able to reclaim memory.  
> > What happens when there is no external freeing, nothing else is trying to 
> > reclaim, or nothing else is able to reclaim?  Yeah, that's the big 
> > problem.  In my opinion, there's three ways of attacking it: (1) 
> > preallocation so we are less dependent on the page allocator in these 
> > contexts, (2) memory reserves in extraordinary circumstances to allow 
> > forward progress (it's already tunable by min_free_kbytes), and (3) 
> > eventual page allocation failure when neither of these succeed.
> > 
> According to my observations (as posted at
> http://marc.info/?l=linux-mm&m=143239200805478 ), (3) is dangerous because
> it can potentially kill critical processes including global init process.
> Killing a process by invoking the OOM killer sounds safer than (3).
> 

Wow, that's a long changelog :)  I tried my best to look through it and 
http://marc.info/?l=linux-kernel&m=142676304911566 to find where init 
could possibly be killed and it references being killed by SIGBUS at 
pagefault?  I'm not sure how that could be possible since mm_fault_error() 
should be handling VM_FAULT_OOM if any page allocation returns NULL in the 
page fault path.  If that's not being set appropriately (VM_FAULT_OOM on 
page allocation failure), are there stack traces that indicate where that 
might be?  Perhaps this was testing of a patch that was not upstream?

Being killed by SIGBUS certainly should not be the result of the page 
allocator returning NULL, but perhaps I'm missing some failure path that 
never happens because the allocator infinite loop never returns NULL 
today.  Trying option (3), in combination with the others, will 
undoubtedly yield some breakage because of bad failure handling that 
hasn't been exercised before, but this one seems preventable.

> Regarding (2), how can we selectively allow blocking process to access
> memory reserves? Since we don't know the dependency, we can't identify the
> process which should be allowed to access memory reserves. If we allow all
> processes to access memory reserves, unrelated processes could deplete the
> memory reserves while the blocking process is waiting for a lock (either in
> killable or unkillable state). What we need to do to make forward progress
> is not always to allow access to memory reserves. Sometimes making locks
> killable (as posted at http://marc.info/?l=linux-mm&m=142408937117294 )
> helps more.
> 

Yeah, I'm all too familiar with this scenario in the memcg world 
unfortunately.  The only solution that I've come up with, and implemented 
for our kernels to test the theory, is to allow access to memory reserves 
(or for memcg, overcharge) if an allocation continually loops due to the 
oom killer being deferred as a result of a pending oom victim.  Basically, 
my patch causes out_of_memory() to return a pointer to the task_struct of 
the process that we're waiting to exit and the page allocator continually 
checks to ensure it is the same and then when a configurable threshold is 
reached, it gives access to memory reserves.  The thread holding the mutex 
that the oom victim wants will eventually allocate due to this and 
hopefully make forward progress.  The system grinds to a halt if you're 
too conservative in this approach with regards to detecting the infinite 
oom killer deferral.  (How many iterations do you consider to be stall?  
Do you set ALLOC_HIGH | ALLOC_HARDER?  Do you set ALLOC_NO_WATERMARKS?)

> Regarding (1), it would help but insufficient because (2) and (3) unlikely
> work.
> 

Option (1) is somewhat independent of the others and fixable if we find 
situations where memory allocation can be done prior to holding a 
potentially contended mutex.  We hope that nobody is needlessly holding a 
contended mutex while allocating, and that seems to be the case most 
often.  However, there may still be situations where it happens.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>