Re: [patch 16/18] oom: badness heuristic rewrite

David Rientjes <rientjes@xxxxxxxxxx> · Wed, 16 Jun 2010 22:32:09 -0700 (PDT)

On Tue, 8 Jun 2010, Andrew Morton wrote:

> > This a complete rewrite of the oom killer's badness() heuristic which is
> > used to determine which task to kill in oom conditions.  The goal is to
> > make it as simple and predictable as possible so the results are better
> > understood and we end up killing the task which will lead to the most
> > memory freeing while still respecting the fine-tuning from userspace.
> 
> It's not obvious from this description that then end result is better! 

I think it's fairly obvious that predictablility is an important part of 
any heuristic that will determine whether your task survives or dies.

> Have you any testcases or scenarios which got improved?
> 

Yes, as cited below in the changelog with the KDE example.

> > Instead of basing the heuristic on mm->total_vm for each task, the task's
> > rss and swap space is used instead.  This is a better indication of the
> > amount of memory that will be freeable if the oom killed task is chosen
> > and subsequently exits.
> 
> Again, why should we optimise for the amount of memory which a killing
> will yield (if that's what you mean).  We only need to free enough
> memory to unblock the oom condition then proceed.
> 

That's what the oom killer has always done simply because we want to avoid 
subsequent oom conditions in the near future that will require additional 
tasks to be killed.  It seems far better to kill a large memory-hogging 
task[*] than ten smaller tasks that total the same amount of memory usage.

 [*] And, with this rewrite, "memory-hogging" can be defined for the first
     time from userspace with a tunable, oom_score_adj, that actually has
     units so that within a cpuset, for example, we can bias a task by
     25% of available memory or bias other tasks against it by 25%.  For
     the first time ever, we can say "this task should be able to use 25%
     more memory than other tasks without getting killed first."

> The last thing we want to do is to kill a process which has consumed
> 1000 CPU hours, or which is providing some system-critical service or
> whatever.  Amount-of-memory-freeable is a relatively minor criterion.
> 

What would you suggest otherwise?  Cputime?  Then we may never be able to 
fork our bash shell or ssh into our machines.

> >  This helps specifically in cases where KDE or
> > GNOME is chosen for oom kill on desktop systems instead of a memory
> > hogging task.
> 
> It helps how?  Examples and test cases?
> 

Because KDE and GNOME typically have very large mm->total_vm values but 
the amount of resident memory in RAM is consumed by other tasks, even 
memory leakers.  mm->total_vm is agreed to be a very poor heursitic 
baseline by just about everyone.

> > The baseline for the heuristic is a proportion of memory that each task is
> > currently using in memory plus swap compared to the amount of "allowable"
> > memory.
> 
> What does "swap" mean?  swapspace includes swap-backed swapcache,
> un-swap-backed swapcache and non-resident swap.  Which of all these is
> being used here and for what reason?
> 

This is swap cache, the number of swap entries for the task which could be 
freeable if the task is killed that could subsequently be used for page 
allocations that triggered the oom killer.  We want to add hints to the 
oom killer so that memory which cannot be used on blockable memory 
allocations may be freed so we don't call into the oom killer again in the 
near future.

> > /proc/pid/oom_adj is changed so that its meaning is rescaled into the
> > units used by /proc/pid/oom_score_adj, and vice versa.  Changing one of
> > these per-task tunables will rescale the value of the other to an
> > equivalent meaning.  Although /proc/pid/oom_adj was originally defined as
> > a bitshift on the badness score, it now shares the same linear growth as
> > /proc/pid/oom_score_adj but with different granularity.  This is required
> > so the ABI is not broken with userspace applications and allows oom_adj to
> > be deprecated for future removal.
> 
> It was a mistake to add oom_adj in the first place.  Because it's a
> user-visible knob which us tied to a particular in-kernel
> implementation.  As we're seeing now, the presence of that knob locks
> us into a particular implementation.
> 

Agreed.

> Given that oom_score_adj is just a rescaled version of oom_adj
> (correct?), I guess things haven't got a lot worse on that front as a
> result of these changes.
> 

No, it's not a rescaled version at all, we merely rescale oom_adj to 
oom_score_adj units because everyone objected to removing oom_adj without 
deprecation first.  oom_score_adj has units: a proportion of memory 
available to the application, meaning how much of the system, memcg, 
cpuset, or mempolicy it should be biased or favored by.  Please see the 
change to Documentation/filesystems/proc.txt which explain this pretty 
elaborately.

> General observation regarding the patch description: I'm not seeing a
> lot of reason for merging the patch!  What value does it bring to our
> users?  What problems got solved?
> 

It significantly improves the oom killer's predictability, it protects 
vital system tasks like KDE and GNOME on the desktop, it allows users to 
tune each task with a bias or preference in units they understand to 
affect its score, and it allows that interface to remain constant and 
valid even when those tasks are subsequently attached to a cgroup or bound 
to a mempolicy (or their limits or set of allowed nodes are changed).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>