Re: blinking lights

Ric Wheeler <rwheeler@xxxxxxxxxx> · Thu, 25 Oct 2018 10:35:33 -0400

On 10/25/18 8:51 AM, John Spray wrote:
On Wed, Oct 24, 2018 at 7:30 PM Ric Wheeler <rwheeler@xxxxxxxxxx> wrote:
On 10/24/18 2:08 PM, Lars Marowsky-Bree wrote:
On 2018-10-24T19:03:45, John Spray <jspray@xxxxxxxxxx> wrote:

Actually this should probably be build with the extension to the whole
node in mind! In case of a total node failure in a 500 node DC, this
might need to be blinked just as well.
Do you mean blinking the node when it's running but none of its OSDs
are, or blinking a fully offline node using lights-off management
layers?
Both would be useful in the long run. Lights-off, or even some tools
like LightPath or whatever it was called that actually allowed to blink
the entire path to a device - in a DC with 20k drives you don't want to
be looking for the one blinking one, you want the light on the row,
then the rack, and then the drive to show you the path.

I'm not saying this needs to be in from day 0, but as we think about it,
we may want to keep the idea in mind - e.g., somehow relate this to the
topology of the hw we have in the crushmap?

Regards,
      Lars

Back at Thinking Machines, we had enough lights to blink that we could make a
big rolling arrow pointing at something for you :)

https://www.top500.org/featured/systems/cm-5-los-alamos-national-lab/

In all seriousness, one issue other storage providers (even ones with total
control on the hardware have had) is that the blinking light code is done by SES
by poking per hardware component firmware and that has been moderately flaky.

I don't recall who listened to Erik talk at Vault (?) a few years back about
their blinking light code, but it did occasionally light the led in the wrong
enclosure due to a firmware bug.

To be clear, overall, this is a huge need so I am hopeful it will be a major
step forward for us, but we need to be very careful about testing and maybe
looking at probing the hardware path/blacklisting parts that are know to be flaky.
True for certain definitions of "we" -- when Ceph is consuming
libstoragemgmt, it seems like libstoragemgmt would be the sensible
layer to blacklist any known-buggy hardware, rather than the higher
level storage systems.

John

Typically, we put blacklist of things known not to work down at the lowest point 
of the stack that is in common (i.e., a driver in linux will blacklist/whitelist 
things) so that makes sense for stacks running with libstoragemgmt.

If other implementations avoid using this, I suppose they can make their own 
list or copy it out.

My broader hope is that we can get a good test to help validate the right thing 
happens that will let anyone shake out bad hardware or configurations. If such a 
test is used widely, hardware vendors might well start using it to weed out bad 
hardware before we have to...

ric