>>>>> "John" == John Ogness <john.ogness@xxxxxxxxxxxxx> writes: John> On 2019-03-06, Steven Rostedt <rostedt@xxxxxxxxxxx> wrote: >>> This bug only happens if we select large logbuffer (millions of >>> characters). With smaller log buffer, there are messages "** X printk >>> messages dropped", but there's no lockup. >>> >>> The kernel apparently puts 2 million characters into a console log >>> buffer, then takes some lock and than tries to write all of them to a >>> slow serial line. >>> >>> [...] >>> >>> The MD-RAID is supposed to recalculate data for the corrupted device >>> and bring it back to life. However, scrubbing the MD-RAID device >>> resulted in a lot of reads from the device with bad checksums, these >>> were reported to the log and killed the machine. >>> >>> I made a patch to dm-integrity to rate-limit the error messages. But >>> anyway - killing the machine in case of too many log messages seems >>> bad. If the log messages are produced faster than the kernel can >>> write them, the kernel should discard some of them, not kill itself. >> >> Sounds like another aurgment for the new printk design. John> Assuming the bad checksum messages are considered an emergency John> (for example, at least loglevel KERN_WARN), then the new printk John> design would print those messages synchronously to the slow John> serial line in the context of the driver as the driver is John> producing them. John> There wouldn't be a lock-up, but it would definitely slow down John> the driver. The situation of "messages being produced faster John> than the kernel can write them" would never exist because the John> printk() call will only return after the writing is completed. I John> am curious if that would be acceptable here? The real problem is the disconnect between serial console speed and capacity in bits/sec and that of the regular console. Serial, esp at 9600 baud is just a slow and limited resource which needs to be handled differently than a graphical console. I'm also big on ratelimiting messages, even critical warning messages. Too much redundant info doesn't help anyone. And what a subsystem thinks is critical, may not be critical to the system as a whole. In this case, if these checksum messages are telling us that there's corruption, why isn't dm-integrity going readonly and making the block device get the filesystem to also go readonly and to stop the damage right away? If it's just a warning for the niceness, then please rate limit them, or summarize them in some more useful way. Or even log them to somewhere else than the console once the problem is noted. John