Re: [PATCH v3] mm: memcontrol: Don't flood OOM messages with no eligible task.

Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> · Fri, 19 Oct 2018 19:35:53 +0900

On 2018/10/19 8:54, Sergey Senozhatsky wrote:
> On (10/18/18 20:58), Tetsuo Handa wrote:
>>>
>>> A knob might do.
>>> As well as /proc/sys/kernel/printk tweaks, probably. One can even add
>>> echo "a b c d" > /proc/sys/kernel/printk to .bashrc and adjust printk
>>> console levels on login and rollback to old values in .bash_logout
>>> May be.
>>
>> That can work for only single login with root user case.
>> Not everyone logs into console as root user.
> 
> Add sudo ;)

That will not work. ;-) As long as the console loglevel setting is
system wide, we can't allow multiple login sessions.

> 
>> It is pity that we can't send kernel messages to only selected consoles
>> (e.g. all messages are sent to netconsole, but only critical messages are
>> sent to local consoles).
> 
> OK, that's a fair point. There was a patch from FB, which would allow us
> to set a log_level on per-console basis. So the noise goes to heav^W net
> console; only critical stuff goes to the serial console (if I recall it
> correctly). I'm not sure what happened to that patch, it was a while ago.
> I'll try to find that out.

Per a console loglevel setting would help for several environments.
But syzbot environment cannot count on netconsole. We can't expect that
unlimited printk() will become safe.

> 
> [..]
>> That boils down to a "user interaction" problem.
>> Not limiting
>>
>>   "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n"
>>   "Out of memory and no killable processes...\n"
>>
>> is very annoying.
>>
>> And I really can't understand why Michal thinks "handling this requirement" as
>> "make the code more complex than necessary and squash different things together".
> 
> Michal is trying very hard to address the problem in a reasonable way.

OK. But Michal, do we have a reasonable way which can be applied now instead of
my patch or one of below patches? Just enumerating words like "hackish" or "a mess"
without YOU ACTUALLY PROPOSE PATCHES will bounce back to YOU.

> The problem you are talking about is not MM specific. You can have a
> faulty SCSI device, corrupted FS, and so and on.

"a faulty SCSI device, corrupted FS, and so and on" are reporting problems
which will complete a request. They can use (and are using) ratelimit,
aren't they?

"a memcg OOM with no eligible task" is reporting a problem which cannot
complete a request. But it can use ratelimit as well.

But we have an immediately applicable mitigation for a problem that
already OOM-killed threads are triggering "a memcg OOM with no eligible
task" using one of below patches.



>From 0a533d15949eac25f5ce7ce6e53f5830608f08e7 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
Date: Fri, 19 Oct 2018 15:52:56 +0900
Subject: [PATCH v2] mm, oom: OOM victims do not need to select next OOM victim unless __GFP_NOFAIL.

Since commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
oom_reaped tasks") changed to select next OOM victim as soon as
MMF_OOM_SKIP is set, a memcg OOM event from a user process can generate
220+ times (12400+ lines / 730+ KB) of OOM-killer messages with
"Out of memory and no killable processes..." (i.e. no progress) due to
a race window.

This patch completely eliminates such race window by making
out_of_memory() from OOM victims no-op, for OOM victims do not
forever retry (unless __GFP_NOFAIL).

Signed-off-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
---
 mm/oom_kill.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa53..0e8d20b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1058,6 +1058,9 @@ bool out_of_memory(struct oom_control *oc)
 	if (oom_killer_disabled)
 		return false;
 
+	if (tsk_is_oom_victim(current) && !(oc->gfp_mask & __GFP_NOFAIL))
+		return true;
+
 	if (!is_memcg_oom(oc)) {
 		blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
 		if (freed > 0)
-- 
1.8.3.1



>From 4a0e9c9514e1c9c5f90f6247a2c142f622558129 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
Date: Fri, 19 Oct 2018 16:31:48 +0900
Subject: [PATCH] mm, oom: task_will_free_mem() should ignore MMF_OOM_SKIP.

Since commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
oom_reaped tasks") changed to select next OOM victim as soon as
MMF_OOM_SKIP is set, a memcg OOM event from a user process can generate
220+ times (12400+ lines / 730+ KB) of OOM-killer messages with
"Out of memory and no killable processes..." (i.e. no progress) due to
a race window.

But since we added fatal_signal_pending() check to iterations which
can result in a behavior observed in the commit above
(e.g. commit 5abf186a30a89d5b "mm, fs: check for fatal signals in
do_generic_file_read()"), we won't observe such behavior any more.

This patch completely eliminates such race window by removing the
MMF_OOM_SKIP test from task_will_free_mem(), at the risk of falling
into infinite loop when we have to select next OOM victim due to
doing __GFP_NOFAIL allocation requests.

Signed-off-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
---
 mm/oom_kill.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa53..981237c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -800,13 +800,6 @@ static bool task_will_free_mem(struct task_struct *task)
 	if (!__task_will_free_mem(task))
 		return false;
 
-	/*
-	 * This task has already been drained by the oom reaper so there are
-	 * only small chances it will free some more
-	 */
-	if (test_bit(MMF_OOM_SKIP, &mm->flags))
-		return false;
-
 	if (atomic_read(&mm->mm_users) <= 1)
 		return true;
 
-- 
1.8.3.1