On 3/25/21 1:28 AM, Laurence Oberman wrote:
On Mon, 2021-03-22 at 17:02 -0400, Laurence Oberman wrote:
Hello
We have been struggling with this for years.
Systems are getting so large now that a system with multi-terabyte
memory and 1000's of device paths is becoming common.
For example, customers are seeing 16 paths and with a 1000 LUNS thats
16000 multiline console log discovery etc.
We land up in Emergency mode and various incatanations of "cant boot"
due to console putput slowdown that (while worse on serial consoles)
is
still huge overhead that can even require us to use watchdog_thresh
on
the kernel line to prevent the NMI's
I started thinking about a new parameter for scsi_mod that could be
used by sd and the scsi_dh_alua probing / discovery messaging (that
is
so noisy), to quieten it down.
Before I even put efort into this, I wanted to see if you folks have
an
appetite for this.
We have been blacklisting HBA drivers and using verious printk masks
etc to overcome this but a way to mask this within sd.c and
scsi_dh_alua.c I think could work better.
It would not be the default of course but an option to be added for
these huge customers.
I would look do do the minimal logging for a device discovery, just
so
some messaging is there for debug etc and I think it will help.
If this is a crazy idea, let me know and I wont pursue it, but I
decided to just put it out there.
Best Regards
Laurence Oberman
Replying to my own thread with more information
RFE: Introduce two new macros to manage the crazy amount of boot
logging we get with the large LUN count systems
sd_printk_boot_control
sdev_printk_boot_control
These macros have an extra parameter boot_log_enable and if its default
(1) then logs are printed
adding scsi_mod.scsi_alua_boot_logging=0 will quiet down the logging
for these huge systems
With no parameter (default) nothing changes in the logging
With boot log control and regular console
134s to boot and 1987 lines with 80 devices and 2 paths
With no boot control (default) and regular console
170s to boot and about 4000 lines of logging
The patch inline is not final so I did not send with git given this is
an RFE.
t is included to show the changes I was thinking about.
Well, _actually_ it's not just the SCSI drivers; it's just that the scsi
driver exhibits these issues nicely.
The hope I had was that we can resolve this issue by making printk
asynchrounous, such that each call to printk() wouldn't block.
The really should give us most what we want; the only issue is what to
do with those messages which are spooled (but not printed).
For graphical UI this probably doesn't matter as the user will end up
with a graphical interface sooner or later.
For text console things become tricky; we will need the console to get
our prompt, but it might still be busy printing out stuff.
Can't we have a 'low priority' output of these messages, and stop
printing them to the console once 'getty' starts?
Thing is, once 'getty' is up and running the user _can_ log in, so he
can any debugging he likes from the system console; there the message
log on the console is less important as the user can get the system log
via other means.
It only gets important once getty is _not_ up, but then it's less time
critical as there's nothing the user _can_ do.
Thoughts?
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@xxxxxxx +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer