On Wed, 6 Mar 2019 12:11:10 -0500 (EST) Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote: > On Wed, 6 Mar 2019, Theodore Y. Ts'o wrote: > > > On Wed, Mar 06, 2019 at 11:07:55AM -0500, Mikulas Patocka wrote: > > > This bug only happens if we select large logbuffer (millions of > > > characters). With smaller log buffer, there are messages "** X printk > > > messages dropped", but there's no lockup. > > > > > > The kernel apparently puts 2 million characters into a console log buffer, > > > then takes some lock and than tries to write all of them to a slow serial > > > line. > > > > What are the messages; from what kernel subsystem? Why are you seeing > > so many log messages? > > > > - Ted > > The dm-integity subsystem (drivers/md/dm-integrity.c) can be attached to a > block device to provide checksum protection. It will return -EILSEQ and > print a message to a log for every corrupted block. > > Nigel Croxon was testing MD-RAID recovery capabilities in such a way that > he activated RAID-5 array with one leg replaced by a dm-integrity block > device that had all checksums invalid. > > The MD-RAID is supposed to recalculate data for the corrupted device and > bring it back to life. However, scrubbing the MD-RAID device resulted in a > lot of reads from the device with bad checksums, these were reported to > the log and killed the machine. > > > I made a patch to dm-integrity to rate-limit the error messages. But > anyway - killing the machine in case of too many log messages seems bad. > If the log messages are produced faster than the kernel can write them, > the kernel should discard some of them, not kill itself. Sounds like another aurgment for the new printk design. -- Steve