Re: [patch] mm, oom: stop reclaiming if GFP_ATOMIC will start failing soon

David Rientjes <rientjes@xxxxxxxxxx> · Tue, 28 Apr 2020 14:48:25 -0700 (PDT)

On Tue, 28 Apr 2020, Vlastimil Babka wrote:

> > I took a look at doing a quick-fix for the
> > direct-reclaimers-get-their-stuff-stolen issue about a million years
> > ago.  I don't recall where it ended up.  It's pretty trivial for the
> > direct reclaimer to free pages into current->reclaimed_pages and to
> > take a look in there on the allocation path, etc.  But it's only
> > practical for order-0 pages.
> 
> FWIW there's already such approach added to compaction by Mel some time ago,
> so order>0 allocations are covered to some extent. But in this case I imagine
> that compaction won't even start because order-0 watermarks are too low.
> 
> The order-0 reclaim capture might work though - as a result the GFP_ATOMIC
> allocations would more likely fail and defer to their fallback context.
> 

Yes, order-0 reclaim capture is interesting since the issue being reported 
here is userspace going out to lunch because it loops for an unbounded 
amount of time trying to get above a watermark where it's allowed to 
allocate and other consumers are depleting that resource.

We actually prefer to oom kill earlier rather than being put in a 
perpetual state of aggressive reclaim that affects all allocators and the 
unbounded nature of those allocations leads to very poor results for 
everybody.

I'm happy to scope this solely to an order-0 reclaim capture.  I'm not 
sure if I'm clear on whether this has been worked on before and patches 
existed in the past?

Somewhat related to what I described in the changelog: we lost the "page 
allocation stalls" artifacts in the kernel log for 4.15.  The commit 
description references an asynchronous mechanism for getting this 
information; I don't know where this mechanism currently lives.