Re: Serial console is causing system lock-up

Petr Mladek <pmladek@xxxxxxxx> · Thu, 7 Mar 2019 16:16:57 +0100

On Wed 2019-03-06 12:11:10, Mikulas Patocka wrote:
> 
> 
> On Wed, 6 Mar 2019, Theodore Y. Ts'o wrote:
> 
> > On Wed, Mar 06, 2019 at 11:07:55AM -0500, Mikulas Patocka wrote:
> > > This bug only happens if we select large logbuffer (millions of 
> > > characters). With smaller log buffer, there are messages "** X printk 
> > > messages dropped", but there's no lockup.
> > > 
> > > The kernel apparently puts 2 million characters into a console log buffer, 
> > > then takes some lock and than tries to write all of them to a slow serial 
> > > line.
> > 
> > What are the messages; from what kernel subsystem?  Why are you seeing
> > so many log messages?
> > 
> > 					- Ted
> 
> The dm-integity subsystem (drivers/md/dm-integrity.c) can be attached to a 
> block device to provide checksum protection. It will return -EILSEQ and 
> print a message to a log for every corrupted block.
> 
> Nigel Croxon was testing MD-RAID recovery capabilities in such a way that 
> he activated RAID-5 array with one leg replaced by a dm-integrity block 
> device that had all checksums invalid.
> 
> The MD-RAID is supposed to recalculate data for the corrupted device and 
> bring it back to life. However, scrubbing the MD-RAID device resulted in a 
> lot of reads from the device with bad checksums, these were reported to 
> the log and killed the machine.
> 
> 
> I made a patch to dm-integrity to rate-limit the error messages. But 
> anyway - killing the machine in case of too many log messages seems
> bad. If the log messages are produced faster than the kernel can write them, 
> the kernel should discard some of them, not kill itself.

printk() could not easily detect where the messages come from and
if it is acceptable to drop them.

In general, an "unlimited" output of messages that are not much useful
looks like a bug on the caller side. Some rate-limiting in
dm-integrity code looks appropriate here.

Even better might be to stop printing the messages after X occurrences
until the check is completed. It might do something like:

int errors_count = 0;

while(find_error()) {
	errors_count++;
	if (errors_count <= 10)
		pr_err("...");
	else if (errors_count == 11)
		pr_error(¨Too many errors. Continuing check silently\n");
}

if (errors_count)
	pr_error("Check finished. %d errors detected\n", errors_count);

Note that rate-limit is a bit ugly. It does not have feedback from
printk/consoles how long it takes to get the messages out. It might
not be enough if the console is very slow and if there are other
printk() users at the same time.

That said. I agree that printk() should not kill the system. It should
survive these "mistakes". On the other hand, printk() users must
cooperate. The log buffer and console bandwith is a limited resource.

Best Regards,
Petr