Re: Serial console is causing system lock-up

Mikulas Patocka <mpatocka@xxxxxxxxxx> · Wed, 6 Mar 2019 12:11:10 -0500 (EST)

On Wed, 6 Mar 2019, Theodore Y. Ts'o wrote:

> On Wed, Mar 06, 2019 at 11:07:55AM -0500, Mikulas Patocka wrote:
> > This bug only happens if we select large logbuffer (millions of 
> > characters). With smaller log buffer, there are messages "** X printk 
> > messages dropped", but there's no lockup.
> > 
> > The kernel apparently puts 2 million characters into a console log buffer, 
> > then takes some lock and than tries to write all of them to a slow serial 
> > line.
> 
> What are the messages; from what kernel subsystem?  Why are you seeing
> so many log messages?
> 
> 					- Ted

The dm-integity subsystem (drivers/md/dm-integrity.c) can be attached to a 
block device to provide checksum protection. It will return -EILSEQ and 
print a message to a log for every corrupted block.

Nigel Croxon was testing MD-RAID recovery capabilities in such a way that 
he activated RAID-5 array with one leg replaced by a dm-integrity block 
device that had all checksums invalid.

The MD-RAID is supposed to recalculate data for the corrupted device and 
bring it back to life. However, scrubbing the MD-RAID device resulted in a 
lot of reads from the device with bad checksums, these were reported to 
the log and killed the machine.

I made a patch to dm-integrity to rate-limit the error messages. But 
anyway - killing the machine in case of too many log messages seems bad. 
If the log messages are produced faster than the kernel can write them, 
the kernel should discard some of them, not kill itself.

Mikulas