Re: blinking lights

John Spray <jspray@xxxxxxxxxx> · Wed, 24 Oct 2018 11:26:45 +0100

On Wed, Oct 24, 2018 at 12:09 AM Sage Weil <sweil@xxxxxxxxxx> wrote:
>
> I gave the latest lsmcli (libstoragemgmt) another try and it can blink the
> HDD lights on my generic 2u supermicro boxes!  It was a bit of a hassle
> because ubuntu has an ancient version packaged, but once I built from
> source it can do 'ident' (blinky red light) or 'fault' (solid red light).
> Pretty simple!  And now is the time to harass the ubuntu/debian folks to
> get this into the next round of releases so we can take advantage of it
> (Fedora/RHEL/CentOS should already have a good version.)
>
> With the new device tracking that's coming in nautilus, I think we have
> most of the pieces to surface useful ceph controls to turn lights on and
> off.  For example,
>
> $ ceph device ls
> DEVICE                                  HOST:DEV      DAEMONS LIFE EXPECTANCY
> Crucial_CT1024M550SSD1_14160C164100     stud:sdd      osd.40  >6w
> Crucial_CT1024M550SSD1_14210C25B79E     eutow:sds     osd.19  >6w
>
> So we could add
>
> $ ceph device ident-on Crucial_CT1024M550SSD1_14160C164100
> $ ceph device fault-on Crucial_CT1024M550SSD1_14210C25B79E
> ...
> $ ceph device ident-off Crucial_CT1024M550SSD1_14160C164100
> $ ceph device fault-off Crucial_CT1024M550SSD1_14210C25B79E
>
> or perhaps
>
> $ ceph osd ident-on osd.123
> $ ceph osd fault-on osd.124
>
> (although not that osds maybe backed by multiple devices, and you probably
> don't want to pull the shared db/journal device in most cases).

I think we need both:
 - the per-OSD commands would operate on the primary device for the
OSD, ignoring any shared db/wal devices.  This would be the "normal"
path for responding to OSD failures.
 - the per-device command would only be used for shared db/wal
devices, for the (hopefully) rarer case of shared device failures.

UI flows for both cases should start with the initial report of a
failure -- we already have that for individual OSD devices, but
perhaps we also need a new health check that identifies when all the
OSDs using a particular shared device have failed, and reports
suspected failure of the shared device.

> My current thinking is that which lights should be on is persistently
> stored by Ceph, and raises a HEALTH_WARN (or HEALTH_INFO, nudge nudge)
> alert so that the operator knows that the light(s) are (still) on.
>
> How to run nmcli
> ----------------
>
> We can pretty trivially invoke 'lsmcli local-disk-fault-led-off --path
> whatever' (or do something more minimal using the python bindings).  The
> gotcha is that we have to have something running on that host in order to
> do it.
>
> So, it would be pretty easy for an osd to ident its device(s) when it is
> up, but if it's not up, then... not so much.
>
> A few options:
>
> 1) Only do the ident/fault from a running OSD.  This is pretty limiting,
> and also runs the danger of not being able to turn the light off (if the
> OSD then goes down).
>
> 2) Trigger the lights from any OSD (or possibly other daemon) that happens
> to be running on the same host.  This probably covers most cases, but..
> it's still a bit limited.  What if no OSDs are up?  What if there is only
> one OSD on the host and it is down?
>
> 3) Delegate this to the new orchestrator.  Kube can just run this command
> wherever we want.  Ansible presumably can too.

I agree with Kai and Jan that this is the way to go.

Using the libstoragemgmt network service is still possible, it would
just be an implementation detail for the orchestrator itself.  I can
imagine that in some future container environments, deploying
something like the libstoragemgmt network service becomes quite
cheap/easy, and saves the effort of tools like Rook implementing their
own agent hooks -- but Ceph won't care.

This would be an area where we need to get the orchestrator's device
names in line with Ceph's internal device naming -- that would be
useful anyway for other orchestrator functionality.

> 4) Depend on the libstoragemgmt network service.  nmcli is just one part
> of the suite... there's also a REST API that lets you do stuff.  There are
> presumably certificates to configure and such to make it all work, though.
>
> Also, there are some implementation oddities.  The on/off state source
> of truth is the enclosure itself.  So if you turn the light off in ceph,
> we need to be certain we turned it off with the device before we clear out
> our state.  Maybe we have states like off, pending-on, on, pending-off,
> and we don't transition from pending-foo to foo until we get a success
> from the command that is supposed to toggle the light state.

I'd be inclined to just make the command synchronous, and return an
error if the host is unreachable (perhaps with a special force flag to
clear out Ceph's state if the host is gone and never coming back).

John

>
> Thoughts?  I think this is within striking distance (finally) and it would
> be sweet to land it in nautilus!
>
> sage