Re: blinking lights

Brett Niver <bniver@xxxxxxxxxx> · Wed, 24 Oct 2018 07:37:13 -0400

Which HDD lights?  We had dedicated marker leds in Symm, I was unsure
if there was a standard for xDDs that would allow us to uniquely
"mark" them?

On Tue, Oct 23, 2018 at 7:08 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> I gave the latest lsmcli (libstoragemgmt) another try and it can blink the
> HDD lights on my generic 2u supermicro boxes!  It was a bit of a hassle
> because ubuntu has an ancient version packaged, but once I built from
> source it can do 'ident' (blinky red light) or 'fault' (solid red light).
> Pretty simple!  And now is the time to harass the ubuntu/debian folks to
> get this into the next round of releases so we can take advantage of it
> (Fedora/RHEL/CentOS should already have a good version.)
>
> With the new device tracking that's coming in nautilus, I think we have
> most of the pieces to surface useful ceph controls to turn lights on and
> off.  For example,
>
> $ ceph device ls
> DEVICE                                  HOST:DEV      DAEMONS LIFE EXPECTANCY
> Crucial_CT1024M550SSD1_14160C164100     stud:sdd      osd.40  >6w
> Crucial_CT1024M550SSD1_14210C25B79E     eutow:sds     osd.19  >6w
>
> So we could add
>
> $ ceph device ident-on Crucial_CT1024M550SSD1_14160C164100
> $ ceph device fault-on Crucial_CT1024M550SSD1_14210C25B79E
> ...
> $ ceph device ident-off Crucial_CT1024M550SSD1_14160C164100
> $ ceph device fault-off Crucial_CT1024M550SSD1_14210C25B79E
>
> or perhaps
>
> $ ceph osd ident-on osd.123
> $ ceph osd fault-on osd.124
>
> (although not that osds maybe backed by multiple devices, and you probably
> don't want to pull the shared db/journal device in most cases).
>
> My current thinking is that which lights should be on is persistently
> stored by Ceph, and raises a HEALTH_WARN (or HEALTH_INFO, nudge nudge)
> alert so that the operator knows that the light(s) are (still) on.
>
> How to run nmcli
> ----------------
>
> We can pretty trivially invoke 'lsmcli local-disk-fault-led-off --path
> whatever' (or do something more minimal using the python bindings).  The
> gotcha is that we have to have something running on that host in order to
> do it.
>
> So, it would be pretty easy for an osd to ident its device(s) when it is
> up, but if it's not up, then... not so much.
>
> A few options:
>
> 1) Only do the ident/fault from a running OSD.  This is pretty limiting,
> and also runs the danger of not being able to turn the light off (if the
> OSD then goes down).
>
> 2) Trigger the lights from any OSD (or possibly other daemon) that happens
> to be running on the same host.  This probably covers most cases, but..
> it's still a bit limited.  What if no OSDs are up?  What if there is only
> one OSD on the host and it is down?
>
> 3) Delegate this to the new orchestrator.  Kube can just run this command
> wherever we want.  Ansible presumably can too.
>
> 4) Depend on the libstoragemgmt network service.  nmcli is just one part
> of the suite... there's also a REST API that lets you do stuff.  There are
> presumably certificates to configure and such to make it all work, though.
>
> Also, there are some implementation oddities.  The on/off state source
> of truth is the enclosure itself.  So if you turn the light off in ceph,
> we need to be certain we turned it off with the device before we clear out
> our state.  Maybe we have states like off, pending-on, on, pending-off,
> and we don't transition from pending-foo to foo until we get a success
> from the command that is supposed to toggle the light state.
>
> Thoughts?  I think this is within striking distance (finally) and it would
> be sweet to land it in nautilus!
>
> sage