with respect to commands on/off potentially failing, what I've done in the past is to just refresh led state (i.e run the command to make led match state) at a "low" frequency, where "low" is appropriate for the system. as long as there is a single point of control and truth - that generally has worked for me. On Wed, Mar 6, 2019 at 8:50 AM Daniel Gryniewicz <dang@xxxxxxxxxx> wrote: > > (I've done some light blinking management in the past, and it is, > indeed, complicated) > > On 3/5/19 2:39 PM, Sebastian Wagner wrote: > > As far as I can see, nothing is going to work. > > > > The state stored in Ceph will probably not match with the reality: > > > > * Users may reboot machines without telling Ceph. > > If Ceph has a daemon running on that machine, we should be able to > detect the reboot, yes? > > > * Uses will start from scratch with a new ceph cluster > > This is a real issue, more below. > > > * Users will want to enable lights on disks not yet known to Ceph > > I don't think this is an issue; see below. > > > * Issuing on/off commands will fail. > > There's, unfortunately, not much we can do about this. > > > My proposal, based on past experience, is that we declare a set of disks > we "own" (presumably the set with OSDs on them), and actively manage the > light state of those disks. We keep a list of the light state for each > such disk, and whenever that list changes, we go through and actively > set the light state of all the disks we own to either on or off. The > light state in general isn't important, but only when a user actually > wants to see a disk lit (or unlit), so I think we can get away with > making sure the lights are correct when they change, possibly with a > button that resets all lights to the correct state, in case someone > changed lights behind our back. > > Daniel > > > > > Especially as there is a high probability that devices are broken on day > > one. > > > > I don't see any place (neither in Ceph nor in the orchestrator) where we > > can store the state of LEDs reliable. I'd suggest that we take the list > > of enabled lights simply as a rough advice. > > > > Sebastian > > > > > > > > Am 28.02.19 um 18:07 schrieb Brett Niver: > >> That seems different than reading the state of an LED, but rather > >> tracking LEDs have been turned on or not. I.e. internal state - > >> doesn't have to match actual diode state, just need to be controlled > >> centrally - one point of truth. Actually physically reading an LED > >> isn't actually always reliable. > >> > >> On Thu, Feb 28, 2019 at 11:59 AM Sage Weil <sweil@xxxxxxxxxx> wrote: > >>> > >>> On Thu, 28 Feb 2019, Brett Niver wrote: > >>>> Why do we care about state? At some level the code has reasons to > >>>> want the LED to be either on or off... > >>> > >>> Mostly we don't need to care. I can think of a couple problem > >>> scenarios, though: > >>> > >>> - Someone out of band turns a light on. Then ceph turns on another light, > >>> a human sees the first light, a pulls the wrong drive. > >>> > >>> - What if the host is down, but you want the health warning to go away? > >>> There needs to be some 'force' option that will proceed to forget the > >>> light was ever on when we can't reach the host, but that relies on a human > >>> operator promising that the host really is off and thus the light won't > >>> come back on. > >>> > >>> - We have some bug/race in our code that means we fail to turn off the > >>> light before removing our notion that the light is on. Maybe an aborted > >>> attempt to turn the light on has some slow request wandering through the > >>> orchestrator queue of stuff to do and finally executes sometime after we > >>> tell the system to turn the light back off? > >>> > >>> sage > >>> > >>> > >>> > >>>> > >>>> On Thu, Feb 28, 2019 at 9:11 AM Sage Weil <sweil@xxxxxxxxxx> wrote: > >>>>> > >>>>> On Thu, 28 Feb 2019, Tim Serong wrote: > >>>>>> On 02/28/2019 09:50 AM, Travis Nielsen wrote: > >>>>>>> On Wed, Feb 27, 2019 at 3:42 PM Sage Weil <sweil@xxxxxxxxxx> wrote: > >>>>>>>> > >>>>>>>> On Wed, 27 Feb 2019, Travis Nielsen wrote: > >>>>>>>>> Some questions and comments: > >>>>>>>>> - What is the user interaction? Is he specifying an OSD ID for which > >>>>>>>>> he wants to blink the light or what is $PATH? If $PATH is a device > >>>>>>>>> name such as /dev/sdb we would need to translate the OSD ID to the > >>>>>>>>> device. > >>>>>>>> > >>>>>>>> Right now the module implements > >>>>>>>> > >>>>>>>> ceph device {ident,fault}-light-{on,off} <devid> > >>>>>>>> > >>>>>>>> although once this is all working we can also add commands that operate on > >>>>>>>> osd IDs. > >>>>>> > >>>>>> Presumably the OSD commands will just be implemented directly inside > >>>>>> ceph-mgr (which can get OSD metadata to map IDs back to the relevant > >>>>>> hostnames and device paths)? Or is there anything special an individual > >>>>>> orchesetrator might need to do for this case? > >>>>> > >>>>> Right, it'll just be a slightly more complicated command in the blinky > >>>>> module (or wherever we move this code to later). > >>>>> > >>>>>>>>> - This feels like a "desired state" way of doing things since you want > >>>>>>>>> a light on until you decide to turn it off. In this case, we could > >>>>>>>>> create a CRD for desired state of device lights. CRDs are the way the > >>>>>>>>> rook module should interact with the rook operator. > >>>>>>>>> - Whenever the CRD changes, rook would update the lights. When > >>>>>>>>> rook starts, it would also ensure the lights are set appropriately. > >>>>>>>>> - If a CRD is created it could mean the light should turn on for > >>>>>>>>> that device. If the CRD is deleted, the light should turn off. If > >>>>>>>>> there were different blinking modes, there could be a setting in the > >>>>>>>>> CRD to indicate such. > >>>>>>>> > >>>>>>>> That works. I was just thinking that since the mgr is already maintaining > >>>>>>>> this set of desired-on lights we could keep the rook side of it simple. > >>>>>>>> > >>>>>>> > >>>>>>> Ah i missed that the mgr already stored this state. So if we can't > >>>>>>> detect the actual state of the lights, this means the mgr is only > >>>>>>> keeping track of the desire to turn the light on or off? And this > >>>>>>> would translate to a health warning if a light should be on. > >>>>>>> > >>>>>>>>> - What does it take to detect the current state of the lights? Do we > >>>>>>>>> run lsmcli on each node? If so, the discovery daemonset would make > >>>>>>>>> sense to do this. > >>>>>>>> > >>>>>>>> If rook took the additional step of detecting lights that are on (due to > >>>>>>>> external actors) that would make the whole thing a bit more robust, and be > >>>>>>>> a good reason to bother with teh complexity of a CRD. I don't see > >>>>>>>> anything to get current status from the version I have on fedora 29, > >>>>>>>> though. > >>>>>>>> > >>>>>>>>> If we didn't use a CRD, the rook module could store the settings in a > >>>>>>>>> configmap, then run a k8s job itself to turn the lights on or off. > >>>>>>>>> However, I'd say the CRDs are the more natural approach. > >>>>>>>> > >>>>>>>> If we can't detect the current state with current tools, I wonder if just > >>>>>>>> having the mgr module schedule a one-off command to run lsmcli is > >>>>>>>> simpler... does having rook store the state in a configmap or crd buy us > >>>>>>>> anything? > >>>>>>>> > >>>>>>> > >>>>>>> Right, if we can't detect the current state of the lights, rook can't > >>>>>>> really manage the desired state and may not make sense for rook to get > >>>>>>> involved here. The mgr module could easily run a k8s job directly to > >>>>>>> turn the light on or off and we wouldn't worry about managing desired > >>>>>>> state. > >>>>>> I'd suggest the same is true for other ochestrators > >>>>>> (ansible/deepsea/ssh). If we can't detect the state, we shouldn't do > >>>>>> anything at the individual orchestrator level. (If we could detect > >>>>>> state, we'd just want to pass it up to ceph-mgr, rather than having each > >>>>>> individual module implement its own record of LED state) > >>>>> > >>>>> Right. > >>>>> > >>>>> sage > >>>> > >>>> > >>>> > >> > > >