Isssues with very large LUN count servers and booting becoming more and more of a problem

Laurence Oberman <loberman@xxxxxxxxxx> · Mon, 22 Mar 2021 17:02:54 -0400

Hello
We have been struggling with this for years.
Systems are getting so large now that a system with multi-terabyte
memory and 1000's of device paths is becoming common.

For example, customers are seeing 16 paths and with a 1000 LUNS thats
16000 multiline console log discovery etc.

We land up in Emergency mode and various incatanations of "cant boot"
due to console putput slowdown that (while worse on serial consoles) is
still huge overhead that can even require us to use watchdog_thresh on
the kernel line to prevent the NMI's

I started thinking about a new parameter for scsi_mod that could be
used by sd and the scsi_dh_alua probing / discovery messaging (that is
so noisy), to quieten it down.

Before I even put efort into this, I wanted to see if you folks have an
appetite for this.

We have been blacklisting HBA drivers and using verious printk masks
etc to overcome this but a way to mask this within sd.c and
scsi_dh_alua.c I think could work better.
It would not be the default of course but an option to be added for
these huge customers.
I would look do do the minimal logging for a device discovery, just so
some messaging is there for debug etc and I think it will help.

If this is a crazy idea, let me know and I wont pursue it, but I
decided to just put it out there.

Best Regards
Laurence Oberman