On 2019-03-06, Steven Rostedt <rostedt@xxxxxxxxxxx> wrote: >> This bug only happens if we select large logbuffer (millions of >> characters). With smaller log buffer, there are messages "** X printk >> messages dropped", but there's no lockup. >> >> The kernel apparently puts 2 million characters into a console log >> buffer, then takes some lock and than tries to write all of them to a >> slow serial line. >> >> [...] >> >> The MD-RAID is supposed to recalculate data for the corrupted device >> and bring it back to life. However, scrubbing the MD-RAID device >> resulted in a lot of reads from the device with bad checksums, these >> were reported to the log and killed the machine. >> >> I made a patch to dm-integrity to rate-limit the error messages. But >> anyway - killing the machine in case of too many log messages seems >> bad. If the log messages are produced faster than the kernel can >> write them, the kernel should discard some of them, not kill itself. > > Sounds like another aurgment for the new printk design. Assuming the bad checksum messages are considered an emergency (for example, at least loglevel KERN_WARN), then the new printk design would print those messages synchronously to the slow serial line in the context of the driver as the driver is producing them. There wouldn't be a lock-up, but it would definitely slow down the driver. The situation of "messages being produced faster than the kernel can write them" would never exist because the printk() call will only return after the writing is completed. I am curious if that would be acceptable here? John Ogness