Re: blinking lights

Ric Wheeler <rwheeler@xxxxxxxxxx> · Tue, 23 Oct 2018 19:55:34 -0400

On 10/23/18 7:08 PM, Sage Weil wrote:
I gave the latest lsmcli (libstoragemgmt) another try and it can blink the
HDD lights on my generic 2u supermicro boxes!  It was a bit of a hassle
because ubuntu has an ancient version packaged, but once I built from
source it can do 'ident' (blinky red light) or 'fault' (solid red light).
Pretty simple!  And now is the time to harass the ubuntu/debian folks to
get this into the next round of releases so we can take advantage of it
(Fedora/RHEL/CentOS should already have a good version.)

With the new device tracking that's coming in nautilus, I think we have
most of the pieces to surface useful ceph controls to turn lights on and
off.  For example,

$ ceph device ls
DEVICE                                  HOST:DEV      DAEMONS LIFE EXPECTANCY
Crucial_CT1024M550SSD1_14160C164100     stud:sdd      osd.40  >6w
Crucial_CT1024M550SSD1_14210C25B79E     eutow:sds     osd.19  >6w

So we could add

$ ceph device ident-on Crucial_CT1024M550SSD1_14160C164100
$ ceph device fault-on Crucial_CT1024M550SSD1_14210C25B79E
...
$ ceph device ident-off Crucial_CT1024M550SSD1_14160C164100
$ ceph device fault-off Crucial_CT1024M550SSD1_14210C25B79E

or perhaps

$ ceph osd ident-on osd.123
$ ceph osd fault-on osd.124

(although not that osds maybe backed by multiple devices, and you probably
don't want to pull the shared db/journal device in most cases).

My current thinking is that which lights should be on is persistently
stored by Ceph, and raises a HEALTH_WARN (or HEALTH_INFO, nudge nudge)
alert so that the operator knows that the light(s) are (still) on.

How to run nmcli
----------------

We can pretty trivially invoke 'lsmcli local-disk-fault-led-off --path
whatever' (or do something more minimal using the python bindings).  The
gotcha is that we have to have something running on that host in order to
do it.

So, it would be pretty easy for an osd to ident its device(s) when it is
up, but if it's not up, then... not so much.

A few options:

1) Only do the ident/fault from a running OSD.  This is pretty limiting,
and also runs the danger of not being able to turn the light off (if the
OSD then goes down).

2) Trigger the lights from any OSD (or possibly other daemon) that happens
to be running on the same host.  This probably covers most cases, but..
it's still a bit limited.  What if no OSDs are up?  What if there is only
one OSD on the host and it is down?

3) Delegate this to the new orchestrator.  Kube can just run this command
wherever we want.  Ansible presumably can too.

4) Depend on the libstoragemgmt network service.  nmcli is just one part
of the suite... there's also a REST API that lets you do stuff.  There are
presumably certificates to configure and such to make it all work, though.

Also, there are some implementation oddities.  The on/off state source
of truth is the enclosure itself.  So if you turn the light off in ceph,
we need to be certain we turned it off with the device before we clear out
our state.  Maybe we have states like off, pending-on, on, pending-off,
and we don't transition from pending-foo to foo until we get a success
from the command that is supposed to toggle the light state.

Thoughts?  I think this is within striking distance (finally) and it would
be sweet to land it in nautilus!

sage

Fantastic!  This has certainly been a long time coming and helps basic 
operations like disk replacement practical...

ric