Re: Serial console is causing system lock-up

John Ogness <john.ogness@xxxxxxxxxxxxx> · Wed, 06 Mar 2019 23:43:45 +0100

On 2019-03-06, Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
>> This bug only happens if we select large logbuffer (millions of
>> characters). With smaller log buffer, there are messages "** X printk
>> messages dropped", but there's no lockup.
>> 
>> The kernel apparently puts 2 million characters into a console log
>> buffer, then takes some lock and than tries to write all of them to a
>> slow serial line.
>>
>> [...]
>>
>> The MD-RAID is supposed to recalculate data for the corrupted device
>> and bring it back to life. However, scrubbing the MD-RAID device
>> resulted in a lot of reads from the device with bad checksums, these
>> were reported to the log and killed the machine.
>> 
>> I made a patch to dm-integrity to rate-limit the error messages. But
>> anyway - killing the machine in case of too many log messages seems
>> bad.  If the log messages are produced faster than the kernel can
>> write them, the kernel should discard some of them, not kill itself.
>
> Sounds like another aurgment for the new printk design.

Assuming the bad checksum messages are considered an emergency (for
example, at least loglevel KERN_WARN), then the new printk design would
print those messages synchronously to the slow serial line in the
context of the driver as the driver is producing them.

There wouldn't be a lock-up, but it would definitely slow down the
driver. The situation of "messages being produced faster than the kernel
can write them" would never exist because the printk() call will only
return after the writing is completed. I am curious if that would be
acceptable here?

John Ogness