Re: [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node

Michal Hocko <mhocko@xxxxxxxx> · Wed, 4 Nov 2020 09:53:43 +0100

On Wed 04-11-20 16:40:21, Feng Tang wrote:
> On Wed, Nov 04, 2020 at 08:58:19AM +0100, Michal Hocko wrote:
> > On Wed 04-11-20 15:38:26, Feng Tang wrote:
> > [...]
> > > > Could you be more specific about the usecase here? Why do you need a
> > > > binding to a pure movable node? 
> > > 
> > > One common configuration for a platform is small size of DRAM plus huge
> > > size of PMEM (which is slower but cheaper), and my guess of their use
> > > is to try to lead the bulk of user space allocation (GFP_HIGHUSER_MOVABLE)
> > > to PMEM node, and only let DRAM be used as less as possible. 
> > 
> > While this is possible, it is a tricky configuration. It is essentially 
> > get us back to 32b and highmem...
> 
> :) Another possible case is similar binding on a memory hotplugable
> platform, which has one unplugable node and several other nodes configured
> as movable only to be hot removable when needed

Yes, another way to shoot your foot ;)

> > As I've said in reply to your second patch. I think we can make the oom
> > killer behavior more sensible in this misconfigured cases but I do not
> > think we want break the cpuset isolation for such a configuration.
> 
> Do you mean we skip the killing and just let the allocation fail? We've
> checked the oom killer code first, when the oom happens, both DRAM
> node and unmovable node have lots of free memory, and killing process
> won't improve the situation.

We already do skip oom killer and fail for lowmem allocation requests already.
This is similar in some sense. Another option would be to kill the
allocating context which will have less corner cases potentially because
some allocation failures might be unexpected.

> (Folloing is copied from your comments for 2/2) 
> > This allows to spill memory allocations over to any other node which
> > has Normal (or other lower) zones and as such it breaks cpuset isolation.
> > As I've pointed out in the reply to your cover letter it seems that
> > this is more of a misconfiguration than a bug.
> 
> For the usage case (docker container running), the spilling is already
> happening, I traced its memory allocation requests, many of them are
> movable, and got fallback to the normal node naturally with current

Could you be more specific? This sounds like a bug. Allocations
shouldn't spill over to a node which is not in the cpuset. There are few
exceptions like IRQ context but that shouldn't happen regurarly.

> code, only a few got blocked, as many of __alloc_pages_nodemask are
> called witih 'NULL' nodemask parameter.
> 
> And I made this RFC patch inspired by code in __alloc_pages_may_oom():
> 
> 	if (gfp_mask & __GFP_NOFAIL)
> 		page = __alloc_pages_cpuset_fallback(gfp_mask, order,
> 				ALLOC_NO_WATERMARKS, ac);

I am not really sure I follow here. __GFP_NOFAIL is a special beast
because such an allocation must not fail. Breaking node affinity is the
only option left. This shouldn't be something used for regular
allocation requests.
-- 
Michal Hocko
SUSE Labs