Re: [PATCH 1/5] mm: Introduce OOM kill timeout.

Michal Hocko <mhocko@xxxxxxx> · Mon, 24 Nov 2014 17:50:32 +0100

On Sun 23-11-14 13:50:07, Tetsuo Handa wrote:
> >From ca8b3ee4bea5bcc6f8ec5e8496a97fd4cab5a440 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
> Date: Sun, 23 Nov 2014 13:38:53 +0900
> Subject: [PATCH 1/5] mm: Introduce OOM kill timeout.
> 
> Regarding many of Linux kernel versions (from unknown till now), any
> local user can give a certain type of memory pressure which causes
> __alloc_pages_nodemask() to keep trying to reclaim memory for presumably
> forever.

Retrying for ever might be an intention (see GFP_NOFAIL).

> As a consequence, such user can disturb any users' activities
> by keeping the system stalled with 0% or 100% CPU usage.

But the above doesn't make much sense to me. Sure reclaim can cause a
lot of CPU cycles to be burnt but most of direct reclaimers are simply
stuck waiting for something - congestion_wait or others.

> On systems where XFS is used, SysRq-f (forced OOM killer) may become
> unresponsive because kernel worker thread which is supposed to process
> SysRq-f request is blocked by previous request's GFP_WAIT allocation.

How is XFS relevant here? Besides that work queue has a fallback mode -
rescuer thread - which processes work items which cannot be processed by
the worker threads because they cannot be created due to allocation
failures. Using workqueues for sysrq triggered OOM is quite suboptimal
but this should be handled on the sysrq layer.

> The problem described above is one of phenomena which is triggered by
> a vulnerability which exists since (if I didn't miss something)
> Linux 2.0 (18 years ago). However, it is too difficult to backport
> patches which fix the vulnerability.

What is the vulnerability?

> Setting TIF_MEMDIE to SIGKILL'ed and/or PF_EXITING thread disables
> the OOM killer. But the TIF_MEMDIE thread may not be able to terminate
> within reasonable duration for some reason. Therefore, in order to avoid
> keeping the OOM killer disabled forever, this patch introduces 5 seconds
> timeout for TIF_MEMDIE threads which are supposed to terminate shortly.

I really do not like this. The timeout sounds arbitrary random. Besides
how would it solve the problem? We would go after another task which
might be blocked on the very same lock. How long should we go? What
happens when all of them wake up and consume all the memory on the way
out because they have access to the memory reserves now?

Also have you actually seen something like that happening?

We had a kind of similar problem in Memory cgroup controller because the
OOM was handled in the allocation path which might sit on many locks and
had to wait for the victim . So waiting for OOM victim to finish would
simply deadlock if the killed task was stuck on any of the locks held by
memcg OOM killer. But this is not the case anymore (we are processing
memcg OOM from the fault path).

The global OOM killer didn't have this kind of problem because OOM
killer doesn't wait for the victim to finish. If the victim waits for
something else that cannot make any progress because of the short memory
then I would call it a bug and it shouldn't be papered over and rather
fixed properly.

The oom killer code is quite complex and subtle already so I really do
not think that we should be adding ad-hoc heuristics without really good
reasons and when all other options are considered not viable. I do not
see any real life problem stated here and what is worse the changelog is
misleading in several ways. So NAK to this patch.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>