On Mon 2018-04-23 14:21:33, Sergey Senozhatsky wrote: > On (04/20/18 11:12), Petr Mladek wrote: > call_console_drivers() is super complex, unbelievable complex. In fact, > it's so complex that we never know where we will end up, because it can > pass the control to almost every core kernel mechanism or subsystem: I fully agree. We will never be able to completely avoid messages from these code paths and it would even be contra-productive. People need to see the problems and be able to debug them. BTW: I wrote this in the patch description. > A very quick googling: Nice collection. Just note that the useful (ratelimited) information always fits into the first 100 lines in all these examples: > https://bugzilla.altlinux.org/attachment.cgi?id=5811 The two WARNINGs starts repeating after 65 lines. Note that the backtraces at the end of the log are from NMI watchdog => not ratelimited by this patch (with the extra fix) > https://access.redhat.com/solutions/702533 fits into 52 lines > https://bugzilla.redhat.com/attachment.cgi?id=561164 both warnings fit into 54 lines > https://lists.gt.net/linux/kernel/2341113 BUG report fits into 26 lines > https://www.systutorials.com/linux-kernels/56987/ib-mlx4-reduce-sriov-multicast-cleanup-warning-message-to-debug-level-linux-4-10-17/ NMI bactrace => not affected by this patch (with the extra fix) > https://github.com/raspberrypi/linux/issues/663 Bug report fits into 29 lines > https://bugs.openvz.org/browse/VZWEB-36 Starts to repeat after 79 lines. I wonder if it ever ended without killing the system. > Throttling down that error mechanism to 100 lines > per hour, or 1000 lines per hour is unlikely will be welcomed. I wonder if you have bigger problems with the number of lines or with the length of the period. We simply _must_ limit the number of lines. Otherwise we would never be able to break an _infinite_ loop. But we could eventually replace the time period with a more complex solution. For example, if we call console drivers outside printk_safe context then the "recursive" messages will be written to the mail log buffer directly. Then we could reset the counter of the recursive messages when leaving console_unlock() the regular way. I mean when all lines are handled. > Among all the patches and proposal that we saw so far, one stands out - it's > the original Tejun's patch [offloading to work queue]. Because it has zero > interference with the existing call_console_drivers()->printk() > channels. The only problem is that it does not solve the infinite loop. If writing one line produces one or more new lines (warnings/errors) than the only way out is to start dropping the recursive messages. Offloading would just move the infinite loop to another process. Note that the offload might help if there is a deadlock/livelock between the original printk() caller and console drivers. Then moving console_unlock() to another "safe" context helps to unblock the situation. But the offload has its own problems and limiting the number of recursive messages would solve this as well. > What is so special about this case that we decided to screw up printk() > instead? There are many situations where printk() is limited. For example, we are limited to 8kB in NMI or printk_safe() context. The printk_safe() context is about printk debugging. Also messages from console drivers are about printk debugging. There must be some limitations by definition. > I think that we need to apply the patch below. > That call_console_drivers()->printk->IRQ_work->irq->flush appears to be > pointless. I agree that calling console drivers in printk_safe context is pointless. Normal vprintk_emit() can be safely called because logbuf_lock is not taken here. Also console_unlock() will never by called recursively because it is guarded by console_sem. > Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@xxxxxxxxx> > --- > kernel/printk/printk.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c > index 247808333ba4..484c456c095a 100644 > --- a/kernel/printk/printk.c > +++ b/kernel/printk/printk.c > @@ -2385,9 +2385,11 @@ void console_unlock(void) > */ > console_lock_spinning_enable(); > > + __printk_safe_exit(); > stop_critical_timings(); /* don't trace print latency */ > call_console_drivers(ext_text, ext_len, text, len); > start_critical_timings(); > + __printk_safe_enter(); Is this by intention? What is the reason to call console_lock_spinning_disable_and_check() in printk_safe() context, please? > if (console_lock_spinning_disable_and_check()) { > printk_safe_exit_irqrestore(flags); Best Regards, Petr