Re: blinking lights

John Spray <jspray@xxxxxxxxxx> · Thu, 25 Oct 2018 13:51:50 +0100

On Wed, Oct 24, 2018 at 7:30 PM Ric Wheeler <rwheeler@xxxxxxxxxx> wrote:
>
> On 10/24/18 2:08 PM, Lars Marowsky-Bree wrote:
> > On 2018-10-24T19:03:45, John Spray <jspray@xxxxxxxxxx> wrote:
> >
> >>> Actually this should probably be build with the extension to the whole
> >>> node in mind! In case of a total node failure in a 500 node DC, this
> >>> might need to be blinked just as well.
> >> Do you mean blinking the node when it's running but none of its OSDs
> >> are, or blinking a fully offline node using lights-off management
> >> layers?
> > Both would be useful in the long run. Lights-off, or even some tools
> > like LightPath or whatever it was called that actually allowed to blink
> > the entire path to a device - in a DC with 20k drives you don't want to
> > be looking for the one blinking one, you want the light on the row,
> > then the rack, and then the drive to show you the path.
> >
> > I'm not saying this needs to be in from day 0, but as we think about it,
> > we may want to keep the idea in mind - e.g., somehow relate this to the
> > topology of the hw we have in the crushmap?
> >
> >
> >
> > Regards,
> >      Lars
> >
> Back at Thinking Machines, we had enough lights to blink that we could make a
> big rolling arrow pointing at something for you :)
>
> https://www.top500.org/featured/systems/cm-5-los-alamos-national-lab/
>
> In all seriousness, one issue other storage providers (even ones with total
> control on the hardware have had) is that the blinking light code is done by SES
> by poking per hardware component firmware and that has been moderately flaky.
>
> I don't recall who listened to Erik talk at Vault (?) a few years back about
> their blinking light code, but it did occasionally light the led in the wrong
> enclosure due to a firmware bug.
>
> To be clear, overall, this is a huge need so I am hopeful it will be a major
> step forward for us, but we need to be very careful about testing and maybe
> looking at probing the hardware path/blacklisting parts that are know to be flaky.

True for certain definitions of "we" -- when Ceph is consuming
libstoragemgmt, it seems like libstoragemgmt would be the sensible
layer to blacklist any known-buggy hardware, rather than the higher
level storage systems.

John

>
> Ric
>
>