Re: [PATCH] printk: Ratelimit messages printed by console drivers

Sergey Senozhatsky <sergey.senozhatsky.work@xxxxxxxxx> · Wed, 25 Apr 2018 14:31:46 +0900

On (04/23/18 14:45), Petr Mladek wrote:
[..]
> I am not sure how slow are the slowest consoles. If I take that
> everything should be faster than 1200 bauds. Then 10 minutes
> should be enough for 1000 lines and 80 characters per-line:

Well, the problem with the numbers is that they are too... simple...
let me put it this way.

What if I don't have a slow serial console? Or what if I have NMI
watchdog set to 40 seconds? Or what if I don't have NMIs at all?
Why am I all of a sudden limited by "1200 bauds"?

Another problem is that we limit the *wrong* thing.

Not only because we can [and probably need to] rate-limit the misbehaving
code that calls printk()-s, instead of printk(). But because we claim
that we limit the "number of lines" added recursively. This is wrong.
We limit the number of times vprintk_func() was called, which is != the
number of added lines. Because vprintk_func() is also called for pr_cont()
or printk(KERN_CONT) or printk("missing new line"). Backtraces contain
tons and tons of pr_cont()-s - registers print out, list of modules
print out, stack print out, code print out. Even this thing at the
bottom of a trace:

	Code: 01 ca 49 89 d1 48 89 d1 48 c1 ea 23 48 8b 14 d5 80 23 63 82 49 c1 e9 0c 48 c1 e9 1b 48 85 d2 74 0a 0f b6 c9 48 c1 e1 04 48 01 ca <48> 8b 12 49 c1 e1 06 b9 00 00 00 80 89 7d 80 89 75 84 48 8b 3d

is nothing but a bunch of pr_cont()-s, each of which will individually
end up in vprintk_func(). Error reports in general can contain even more
pr_cont() calls. E.g. core kernel code can hex dump slab memory, while
being called from one of console drivers.

Another problem is that nothing tells us that we *actually* have an
infinite loop. Nothing tells us that every call_console_drivers()
adds more messages to the logbuf. We see just one thing - the current
call_console_drivers() is about to add some lines to the logbuf later
on. OK, why is this a problem? This can be a one time thing. Or
console_unlock() may be in a schedulable context, getting rescheduled
after every line it prints [either implicitly after
printk_safe_exit_irqrestore(), or explicitly by calling into the
scheduler - cond_resched()].

Most likely, we don't even realize how many things we are about to
break.

> Alternatively, it seems that we are going to call console drivers
> outside printk_safe context => the messages will appear in the main
> log buffer immediately => only small risk of a ping-pong with printk
> safe buffers. We might reset the counter when all messages are handled
> in console_unlock(). It will be more complex patch than when using
> ratelimiting but it still should be sane.

We may have some sort of vprintk_func()-based solution, may be.
But we first need a real reason. Right now it looks to me like
we have "a solution" to a problem which we have never witnessed.

That vprintk_func()-based solution, if there will be no other
options on the table, must be much smarter than anything that
we have seen so far.

Sorry.

	-ss