Re: blinking lights via rook

Sage Weil <sweil@xxxxxxxxxx> · Wed, 27 Feb 2019 22:42:05 +0000 (UTC)

On Wed, 27 Feb 2019, Travis Nielsen wrote:
> Some questions and comments:
> - What is the user interaction? Is he specifying an OSD ID for which
> he wants to blink the light or what is $PATH? If $PATH is a device
> name such as /dev/sdb we would need to translate the OSD ID to the
> device.

Right now the module implements

  ceph device {ident,fault}-light-{on,off} <devid>

although once this is all working we can also add commands that operate on 
osd IDs.

> - This feels like a "desired state" way of doing things since you want
> a light on until you decide to turn it off. In this case, we could
> create a CRD for desired state of device lights. CRDs are the way the
> rook module should interact with the rook operator.
>     - Whenever the CRD changes, rook would update the lights. When
> rook starts, it would also ensure the lights are set appropriately.
>     - If a CRD is created it could mean the light should turn on for
> that device. If the CRD is deleted, the light should turn off. If
> there were different blinking modes, there could be a setting in the
> CRD to indicate such.

That works.  I was just thinking that since the mgr is already maintaining 
this set of desired-on lights we could keep the rook side of it simple.  

> - What does it take to detect the current state of the lights? Do we
> run lsmcli on each node? If so, the discovery daemonset would make
> sense to do this.

If rook took the additional step of detecting lights that are on (due to 
external actors) that would make the whole thing a bit more robust, and be 
a good reason to bother with teh complexity of a CRD.  I don't see 
anything to get current status from the version I have on fedora 29, 
though.

> If we didn't use a CRD, the rook module could store the settings in a
> configmap, then run a k8s job itself to turn the lights on or off.
> However, I'd say the CRDs are the more natural approach.

If we can't detect the current state with current tools, I wonder if just 
having the mgr module schedule a one-off command to run lsmcli is 
simpler... does having rook store the state in a configmap or crd buy us 
anything?

sage

> 
> Travis
> 
> On Wed, Feb 27, 2019 at 3:25 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> >
> > On Wed, Feb 27, 2019 at 1:16 PM Sage Weil <sweil@xxxxxxxxxx> wrote:
> > >
> > > See
> > >
> > >         https://github.com/ceph/ceph/pull/26684
> > >         https://pad.ceph.com/p/blinky-lights
> > >
> > > I think the hurdles are:
> > >
> > > - Add the appropriate hook to orchestrator_cli to turn a light on or off.
> > > Right now the code to remote() to the orchestrator is commented out in my
> > > PR.  The call sites have the device id (vendor/model/serial), host, and
> > > device name (e.g., sda).
> > >
> > > - Get a recentish libstoragemgmt into the rook container image, or some
> > > other container image we can schedule.
> > >
> > > - Either teach rook how to do a one-off "run this command on this host" to
> > > turn a light on or off, or teach the mgr rook module to schedule that
> > > command itself.  I'm not sure whether or not we want/need rook in the loop
> > > for turning these lights on or not... thoughts?  It seems like if rook
> > > does it, it needs a configmap (or something) to store the state of lights
> > > it wants on or off so it can reset them when it restarts.  The mgr module
> > > can (should?) do the exact same thing when the mgr restarts.
> >
> > This sounds like you need an interface for querying the state of
> > lights as well then? I presume the dashboard wants to show what lights
> > are on or off, not merely let admins push a button to change them...
> >
> > >
> > > For the record, the lsmcli command we ultimately need to run is
> > >
> > >  lsmcli local-disk-fault-led-on --path $PATH
> > >
> > > modulo s/fault/ident/ or s/on/off/.
> > >
> > > sage
> > >
> 
>