Re: [patch 07/18] oom: filter tasks not sharing the same cpuset

David Rientjes <rientjes@xxxxxxxxxx> · Tue, 8 Jun 2010 17:25:26 -0700 (PDT)

On Tue, 8 Jun 2010, Andrew Morton wrote:

> > Tasks that do not share the same set of allowed nodes with the task that
> > triggered the oom should not be considered as candidates for oom kill.
> > 
> > Tasks in other cpusets with a disjoint set of mems would be unfairly
> > penalized otherwise because of oom conditions elsewhere; an extreme
> > example could unfairly kill all other applications on the system if a
> > single task in a user's cpuset sets itself to OOM_DISABLE and then uses
> > more memory than allowed.
> > 
> > Killing tasks outside of current's cpuset rarely would free memory for
> > current anyway.  To use a sane heuristic, we must ensure that killing a
> > task would likely free memory for current and avoid needlessly killing
> > others at all costs just because their potential memory freeing is
> > unknown.  It is better to kill current than another task needlessly.
> 
> This is all a bit arbitrary, isn't it?  The key word here is "rarely". 

"rarely" certainly is an arbitrary term in this case because it depends 
heavily on the memory usage of other cpuset's on the system.  Consider a 
cpuset with 16G of memory and a single task which consumes most of that 
memory.  Then consider a cpuset with a single 1G node and a task that ooms 
within it; the 16G task in the other cpuset gets killed.

There must either be a complete exclusion or inclusion of a task for 
candidacy if the scale of memory usage amongst our cpusets cannot be 
properly attributed with a single heuristic (such as divide by 4, divide 
by 8, etc).  To me, it never seems approprate to penalize another cpuset's 
tasks by the small chance that it may have allocated atomic memory 
elsewhere or the nodes have been recently changed.  The goal is to be more 
predictable about oom killing decisions without negatively impacting other 
cpusets, and this is a step in that direction.

> If indeed this task had allocated gobs of memory from `current's nodes
> and then sneakily switched nodes, this will be a big regression!
> 

It could be, but that's the fault of userspace for allocating a node that 
is almost full to a new cpuset and expecting it to be completely free.  In 
other words, we can arrange our cpusets with mems however we want but 
we need some guarantee that giving a cpuset completely free memory and 
then killing a task within it because another cpuset went oom doesn't 
happen.

> So..  It's not completely clear to me how we justify this decision. 
> Are we erring too far on the side of keep-tasks-running?  Is failing to
> clear the oom a lot bigger problem than killing an innocent task?  I
> think so.  In which case we should err towards slaughtering the
> innocent?
> 

The one thing we know is that if the victim's mems_allowed is truly 
disjoint from current that there's no guarantee we'll be freeing memory at 
all.  And if we free any, it's the result of the GFP_ATOMIC allocations 
that are allowed anywhere or was previously allocated on one of current's 
mems.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>