On 10/24/18 2:08 PM, Lars Marowsky-Bree wrote:
On 2018-10-24T19:03:45, John Spray <jspray@xxxxxxxxxx> wrote:
Actually this should probably be build with the extension to the whole
node in mind! In case of a total node failure in a 500 node DC, this
might need to be blinked just as well.
Do you mean blinking the node when it's running but none of its OSDs
are, or blinking a fully offline node using lights-off management
layers?
Both would be useful in the long run. Lights-off, or even some tools
like LightPath or whatever it was called that actually allowed to blink
the entire path to a device - in a DC with 20k drives you don't want to
be looking for the one blinking one, you want the light on the row,
then the rack, and then the drive to show you the path.
I'm not saying this needs to be in from day 0, but as we think about it,
we may want to keep the idea in mind - e.g., somehow relate this to the
topology of the hw we have in the crushmap?
Regards,
Lars
Back at Thinking Machines, we had enough lights to blink that we could make a
big rolling arrow pointing at something for you :)
https://www.top500.org/featured/systems/cm-5-los-alamos-national-lab/
In all seriousness, one issue other storage providers (even ones with total
control on the hardware have had) is that the blinking light code is done by SES
by poking per hardware component firmware and that has been moderately flaky.
I don't recall who listened to Erik talk at Vault (?) a few years back about
their blinking light code, but it did occasionally light the led in the wrong
enclosure due to a firmware bug.
To be clear, overall, this is a huge need so I am hopeful it will be a major
step forward for us, but we need to be very careful about testing and maybe
looking at probing the hardware path/blacklisting parts that are know to be flaky.
Ric