Re: blinking lights via rook

Brett Niver <bniver@xxxxxxxxxx> · Wed, 6 Mar 2019 14:04:13 -0500

with respect to commands on/off potentially failing, what I've done in
the past is to just refresh led state (i.e run the command to make led
match state) at a "low" frequency, where "low" is appropriate for the
system.

as long as there is a single point of control and truth - that
generally has worked for me.

On Wed, Mar 6, 2019 at 8:50 AM Daniel Gryniewicz <dang@xxxxxxxxxx> wrote:
>
> (I've done some light blinking management in the past, and it is,
> indeed, complicated)
>
> On 3/5/19 2:39 PM, Sebastian Wagner wrote:
> > As far as I can see, nothing is going to work.
> >
> > The state stored in Ceph will probably not match with the reality:
> >
> > * Users may reboot machines without telling Ceph.
>
> If Ceph has a daemon running on that machine, we should be able to
> detect the reboot, yes?
>
> > * Uses will start from scratch with a new ceph cluster
>
> This is a real issue, more below.
>
> > * Users will want to enable lights on disks not yet known to Ceph
>
> I don't think this is an issue; see below.
>
> > * Issuing on/off commands will fail.
>
> There's, unfortunately, not much we can do about this.
>
>
> My proposal, based on past experience, is that we declare a set of disks
> we "own" (presumably the set with OSDs on them), and actively manage the
> light state of those disks.  We keep a list of the light state for each
> such disk, and whenever that list changes, we go through and actively
> set the light state of all the disks we own to either on or off.  The
> light state in general isn't important, but only when a user actually
> wants to see a disk lit (or unlit), so I think we can get away with
> making sure the lights are correct when they change, possibly with a
> button that resets all lights to the correct state, in case someone
> changed lights behind our back.
>
> Daniel
>
> >
> > Especially as there is a high probability that devices are broken on day
> > one.
> >
> > I don't see any place (neither in Ceph nor in the orchestrator) where we
> > can store the state of LEDs reliable. I'd suggest that we take the list
> > of enabled lights simply as a rough advice.
> >
> > Sebastian
> >
> >
> >
> > Am 28.02.19 um 18:07 schrieb Brett Niver:
> >> That seems different than reading the state of an LED, but rather
> >> tracking LEDs have been turned on or not.  I.e. internal state -
> >> doesn't have to match actual diode state, just need to be controlled
> >> centrally - one point of truth.  Actually physically reading an LED
> >> isn't actually always reliable.
> >>
> >> On Thu, Feb 28, 2019 at 11:59 AM Sage Weil <sweil@xxxxxxxxxx> wrote:
> >>>
> >>> On Thu, 28 Feb 2019, Brett Niver wrote:
> >>>> Why do we care about state?  At some level the code has reasons to
> >>>> want the LED to be either on or off...
> >>>
> >>> Mostly we don't need to care.  I can think of a couple problem
> >>> scenarios, though:
> >>>
> >>> - Someone out of band turns a light on.  Then ceph turns on another light,
> >>> a human sees the first light, a pulls the wrong drive.
> >>>
> >>> - What if the host is down, but you want the health warning to go away?
> >>> There needs to be some 'force' option that will proceed to forget the
> >>> light was ever on when we can't reach the host, but that relies on a human
> >>> operator promising that the host really is off and thus the light won't
> >>> come back on.
> >>>
> >>> - We have some bug/race in our code that means we fail to turn off the
> >>> light before removing our notion that the light is on.  Maybe an aborted
> >>> attempt to turn the light on has some slow request wandering through the
> >>> orchestrator queue of stuff to do and finally executes sometime after we
> >>> tell the system to turn the light back off?
> >>>
> >>> sage
> >>>
> >>>
> >>>
> >>>>
> >>>> On Thu, Feb 28, 2019 at 9:11 AM Sage Weil <sweil@xxxxxxxxxx> wrote:
> >>>>>
> >>>>> On Thu, 28 Feb 2019, Tim Serong wrote:
> >>>>>> On 02/28/2019 09:50 AM, Travis Nielsen wrote:
> >>>>>>> On Wed, Feb 27, 2019 at 3:42 PM Sage Weil <sweil@xxxxxxxxxx> wrote:
> >>>>>>>>
> >>>>>>>> On Wed, 27 Feb 2019, Travis Nielsen wrote:
> >>>>>>>>> Some questions and comments:
> >>>>>>>>> - What is the user interaction? Is he specifying an OSD ID for which
> >>>>>>>>> he wants to blink the light or what is $PATH? If $PATH is a device
> >>>>>>>>> name such as /dev/sdb we would need to translate the OSD ID to the
> >>>>>>>>> device.
> >>>>>>>>
> >>>>>>>> Right now the module implements
> >>>>>>>>
> >>>>>>>>    ceph device {ident,fault}-light-{on,off} <devid>
> >>>>>>>>
> >>>>>>>> although once this is all working we can also add commands that operate on
> >>>>>>>> osd IDs.
> >>>>>>
> >>>>>> Presumably the OSD commands will just be implemented directly inside
> >>>>>> ceph-mgr (which can get OSD metadata to map IDs back to the relevant
> >>>>>> hostnames and device paths)?  Or is there anything special an individual
> >>>>>> orchesetrator might need to do for this case?
> >>>>>
> >>>>> Right, it'll just be a slightly more complicated command in the blinky
> >>>>> module (or wherever we move this code to later).
> >>>>>
> >>>>>>>>> - This feels like a "desired state" way of doing things since you want
> >>>>>>>>> a light on until you decide to turn it off. In this case, we could
> >>>>>>>>> create a CRD for desired state of device lights. CRDs are the way the
> >>>>>>>>> rook module should interact with the rook operator.
> >>>>>>>>>      - Whenever the CRD changes, rook would update the lights. When
> >>>>>>>>> rook starts, it would also ensure the lights are set appropriately.
> >>>>>>>>>      - If a CRD is created it could mean the light should turn on for
> >>>>>>>>> that device. If the CRD is deleted, the light should turn off. If
> >>>>>>>>> there were different blinking modes, there could be a setting in the
> >>>>>>>>> CRD to indicate such.
> >>>>>>>>
> >>>>>>>> That works.  I was just thinking that since the mgr is already maintaining
> >>>>>>>> this set of desired-on lights we could keep the rook side of it simple.
> >>>>>>>>
> >>>>>>>
> >>>>>>> Ah i missed that the mgr already stored this state. So if we can't
> >>>>>>> detect the actual state of the lights, this means the mgr is only
> >>>>>>> keeping track of the desire to turn the light on or off? And this
> >>>>>>> would translate to a health warning if a light should be on.
> >>>>>>>
> >>>>>>>>> - What does it take to detect the current state of the lights? Do we
> >>>>>>>>> run lsmcli on each node? If so, the discovery daemonset would make
> >>>>>>>>> sense to do this.
> >>>>>>>>
> >>>>>>>> If rook took the additional step of detecting lights that are on (due to
> >>>>>>>> external actors) that would make the whole thing a bit more robust, and be
> >>>>>>>> a good reason to bother with teh complexity of a CRD.  I don't see
> >>>>>>>> anything to get current status from the version I have on fedora 29,
> >>>>>>>> though.
> >>>>>>>>
> >>>>>>>>> If we didn't use a CRD, the rook module could store the settings in a
> >>>>>>>>> configmap, then run a k8s job itself to turn the lights on or off.
> >>>>>>>>> However, I'd say the CRDs are the more natural approach.
> >>>>>>>>
> >>>>>>>> If we can't detect the current state with current tools, I wonder if just
> >>>>>>>> having the mgr module schedule a one-off command to run lsmcli is
> >>>>>>>> simpler... does having rook store the state in a configmap or crd buy us
> >>>>>>>> anything?
> >>>>>>>>
> >>>>>>>
> >>>>>>> Right, if we can't detect the current state of the lights, rook can't
> >>>>>>> really manage the desired state and may not make sense for rook to get
> >>>>>>> involved here. The mgr module could easily run a k8s job directly to
> >>>>>>> turn the light on or off and we wouldn't worry about managing desired
> >>>>>>> state.
> >>>>>> I'd suggest the same is true for other ochestrators
> >>>>>> (ansible/deepsea/ssh).  If we can't detect the state, we shouldn't do
> >>>>>> anything at the individual orchestrator level.  (If we could detect
> >>>>>> state, we'd just want to pass it up to ceph-mgr, rather than having each
> >>>>>> individual module implement its own record of LED state)
> >>>>>
> >>>>> Right.
> >>>>>
> >>>>> sage
> >>>>
> >>>>
> >>>>
> >>
> >
>