On Tue, Apr 14, 2020 at 3:39 PM Michal Hocko <mhocko@xxxxxxxxxx> wrote: > > On Sat 11-04-20 05:36:14, Yafang Shao wrote: > > Recently we find an issue that when OOM happens the server is almost > > unresponsive for several minutes. That is caused by a slow serial set > > with "console=ttyS1,19200". As the speed of this serial is too slow, it > > will take almost 10 seconds to print a full OOM message into it. And > > then all tasks allocating pages will be blocked as there is almost no > > pages can be reclaimed. At that time, the memory pressure is around 90 > > for a long time. If we don't print the OOM messages into this serial, > > a full OOM message only takes less than 1ms and the memory pressure is > > less than 40. > > Which part of the oom report takes the most time? I would expect this to > be the dump_tasks part which can be pretty large when there is a lot of > eligible tasks to kill. > Yes, dump_tasks takes around 6s of the total 10s, show_mem take around 2s, and dump_stack takes around 0.8s. > > We can avoid printing OOM messages into slow serial by adjusting > > /proc/sys/kernel/printk to fix this issue, but then all messages with > > KERN_WARNING level can't be printed into it neither, that may loss some > > useful messages when we want to collect messages from the it for > > debugging purpose. > > A large part of the oom report is printed with KERN_INFO log level. So > you can reduce a large part of the output while not losing other > potentially important information. > Reduce the KERN_INFO log can save lots of time, but I just worried that sometimes the user may need the full log and if then can't find these logs they may complain. > > So it is better to decrease the ratelimit. We can introduce some sysctl > > knobes similar with printk_ratelimit and burst, but it will burden the > > amdin. Let the kernel automatically adjust the ratelimit, that would be > > a better choice. > > No new knobs for ratelimiting. Admin shouldn't really care about these > things. Agreed. [snip] > Besides that I strongly suspect that you would be much better of > by disabling /proc/sys/vm/oom_dump_tasks which would reduce the amount > of output a lot. Or do you really require this information when > debugging oom reports? > Yes, disabling /proc/sys/vm/oom_dump_tasks can save lots of time. But I'm not sure whehter we can disable it totally, because disabling it would prevent the tasks log from being wrote into /var/log/messages neither. > > The OOM ratelimit starts with a slow rate, and it will increase slowly > > if the speed of the console is rapid and decrease rapidly if the speed > > of the console is slow. oom_rs.burst will be in [1, 10] and > > oom_rs.interval will always greater than 5 * HZ. > > I am not against increasing the ratelimit timeout. But this patch seems > to be trying to be too clever. Why cannot we simply increase the > parameters of the ratelimit? I justed worried that the user may complain it if too many oom_kill_process callbacks are suppressed. But considering that OOM burst at the same time are always because of the same reason, so I think one snapshot of the OOM may be enough. Simply setting oom_rs with {20 * HZ, 1} can resolve this issue. > I am also interested whether this actually > works. AFAIR ratelimit doesn't really work reliably when the ratelimited > operation takes a long time because the internals have no way to see > when the operation finished. > Agree with you that ratelimit() was not so reliable. > > mm/oom_kill.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++--- > > 1 file changed, 48 insertions(+), 3 deletions(-) > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index dfc357614e56..23dba8ccf313 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -954,8 +954,10 @@ static void oom_kill_process(struct oom_control *oc, const char *message) > > { > > struct task_struct *victim = oc->chosen; > > struct mem_cgroup *oom_group; > > - static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, > > - DEFAULT_RATELIMIT_BURST); > > + static DEFINE_RATELIMIT_STATE(oom_rs, 20 * HZ, 1); > > + int delta; > > + unsigned long start; > > + unsigned long end; > > > > /* > > * If the task is already exiting, don't alarm the sysadmin or kill > > @@ -972,8 +974,51 @@ static void oom_kill_process(struct oom_control *oc, const char *message) > > } > > task_unlock(victim); > > > > - if (__ratelimit(&oom_rs)) > > + if (__ratelimit(&oom_rs)) { > > + start = jiffies; > > dump_header(oc, victim); > > + end = jiffies; > > + delta = end - start; > > + > > + /* > > + * The OOM messages may be printed to a serial with very low > > + * speed, e.g. console=ttyS1,19200. It will take long > > + * time to print these OOM messages to this serial, and > > + * then processes allocating pages will all be blocked due > > + * to it can hardly reclaim pages. That will case high > > + * memory pressure and the system may be unresponsive for a > > + * long time. > > + * In this case, we should decrease the OOM ratelimit or > > + * avoid printing OOM messages into the slow serial. But if > > + * we avoid printing OOM messages into the slow serial, all > > + * messages with KERN_WARNING level can't be printed into > > + * it neither, that may loss some useful messages when we > > + * want to collect messages from the console for debugging > > + * purpose. So it is better to decrease the ratelimit. We > > + * can introduce some sysctl knobes similar with > > + * printk_ratelimit and burst, but it will burden the > > + * admin. Let the kernel automatically adjust the ratelimit > > + * would be a better chioce. > > + * In bellow algorithm, it will decrease the OOM ratelimit > > + * rapidly if the console is slow and increase the OOM > > + * ratelimit slowly if the console is fast. oom_rs.burst > > + * will be in [1, 10] and oom_rs.interval will always > > + * greater than 5 * HZ. > > + */ > > + if (delta < oom_rs.interval / 10) { > > + if (oom_rs.interval >= 10 * HZ) > > + oom_rs.interval /= 2; > > + else if (oom_rs.interval > 6 * HZ) > > + oom_rs.interval -= HZ; > > + > > + if (oom_rs.burst < 10) > > + oom_rs.burst += 1; > > + } else if (oom_rs.burst > 1) { > > + oom_rs.burst = 1; > > + oom_rs.interval = 4 * delta; > > + } > > + > > + } > > > > /* > > * Do we need to kill the entire memory cgroup? > > -- > > 2.18.2 > > -- Thanks Yafang