[RFC PATCH] mm, oom: oom ratelimit auto tuning

Yafang Shao <laoar.shao@xxxxxxxxx> · Sat, 11 Apr 2020 05:36:14 -0400

Recently we find an issue that when OOM happens the server is almost
unresponsive for several minutes. That is caused by a slow serial set
with "console=ttyS1,19200". As the speed of this serial is too slow, it
will take almost 10 seconds to print a full OOM message into it. And
then all tasks allocating pages will be blocked as there is almost no
pages can be reclaimed. At that time, the memory pressure is around 90
for a long time. If we don't print the OOM messages into this serial,
a full OOM message only takes less than 1ms and the memory pressure is
less than 40.

We can avoid printing OOM messages into slow serial by adjusting
/proc/sys/kernel/printk to fix this issue, but then all messages with
KERN_WARNING level can't be printed into it neither, that may loss some
useful messages when we want to collect messages from the it for
debugging purpose.

So it is better to decrease the ratelimit. We can introduce some sysctl
knobes similar with printk_ratelimit and burst, but it will burden the
amdin. Let the kernel automatically adjust the ratelimit, that would be
a better choice.

The OOM ratelimit starts with a slow rate, and it will increase slowly
if the speed of the console is rapid and decrease rapidly if the speed
of the console is slow. oom_rs.burst will be in [1, 10] and
oom_rs.interval will always greater than 5 * HZ.

Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx>
---
 mm/oom_kill.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 48 insertions(+), 3 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index dfc357614e56..23dba8ccf313 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -954,8 +954,10 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 {
 	struct task_struct *victim = oc->chosen;
 	struct mem_cgroup *oom_group;
-	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
-					      DEFAULT_RATELIMIT_BURST);
+	static DEFINE_RATELIMIT_STATE(oom_rs, 20 * HZ, 1);
+	int delta;
+	unsigned long start;
+	unsigned long end;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -972,8 +974,51 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	}
 	task_unlock(victim);
 
-	if (__ratelimit(&oom_rs))
+	if (__ratelimit(&oom_rs)) {
+		start = jiffies;
 		dump_header(oc, victim);
+		end = jiffies;
+		delta = end - start;
+
+		/*
+		 * The OOM messages may be printed to a serial with very low
+		 * speed, e.g. console=ttyS1,19200. It will take long
+		 * time to print these OOM messages to this serial, and
+		 * then processes allocating pages will all be blocked due
+		 * to it can hardly reclaim pages. That will case high
+		 * memory pressure and the system may be unresponsive for a
+		 * long time.
+		 * In this case, we should decrease the OOM ratelimit or
+		 * avoid printing OOM messages into the slow serial. But if
+		 * we avoid printing OOM messages into the slow serial, all
+		 * messages with KERN_WARNING level can't be printed into
+		 * it neither, that may loss some useful messages when we
+		 * want to collect messages from the console for debugging
+		 * purpose. So it is better to decrease the ratelimit. We
+		 * can introduce some sysctl knobes similar with
+		 * printk_ratelimit and burst, but it will burden the
+		 * admin. Let the kernel automatically adjust the ratelimit
+		 * would be a better chioce.
+		 * In bellow algorithm, it will decrease the OOM ratelimit
+		 * rapidly if the console is slow and increase the OOM
+		 * ratelimit slowly if the console is fast. oom_rs.burst
+		 * will be in [1, 10] and oom_rs.interval will always
+		 * greater than 5 * HZ.
+		 */
+		if (delta < oom_rs.interval / 10) {
+			if (oom_rs.interval >= 10 * HZ)
+				oom_rs.interval /= 2;
+			else if (oom_rs.interval > 6 * HZ)
+				oom_rs.interval -= HZ;
+
+			if (oom_rs.burst < 10)
+				oom_rs.burst += 1;
+		} else if (oom_rs.burst > 1) {
+			oom_rs.burst = 1;
+			oom_rs.interval = 4 * delta;
+		}
+
+	}
 
 	/*
 	 * Do we need to kill the entire memory cgroup?
-- 
2.18.2