Which HDD lights? We had dedicated marker leds in Symm, I was unsure if there was a standard for xDDs that would allow us to uniquely "mark" them? On Tue, Oct 23, 2018 at 7:08 PM, Sage Weil <sweil@xxxxxxxxxx> wrote: > I gave the latest lsmcli (libstoragemgmt) another try and it can blink the > HDD lights on my generic 2u supermicro boxes! It was a bit of a hassle > because ubuntu has an ancient version packaged, but once I built from > source it can do 'ident' (blinky red light) or 'fault' (solid red light). > Pretty simple! And now is the time to harass the ubuntu/debian folks to > get this into the next round of releases so we can take advantage of it > (Fedora/RHEL/CentOS should already have a good version.) > > With the new device tracking that's coming in nautilus, I think we have > most of the pieces to surface useful ceph controls to turn lights on and > off. For example, > > $ ceph device ls > DEVICE HOST:DEV DAEMONS LIFE EXPECTANCY > Crucial_CT1024M550SSD1_14160C164100 stud:sdd osd.40 >6w > Crucial_CT1024M550SSD1_14210C25B79E eutow:sds osd.19 >6w > > So we could add > > $ ceph device ident-on Crucial_CT1024M550SSD1_14160C164100 > $ ceph device fault-on Crucial_CT1024M550SSD1_14210C25B79E > ... > $ ceph device ident-off Crucial_CT1024M550SSD1_14160C164100 > $ ceph device fault-off Crucial_CT1024M550SSD1_14210C25B79E > > or perhaps > > $ ceph osd ident-on osd.123 > $ ceph osd fault-on osd.124 > > (although not that osds maybe backed by multiple devices, and you probably > don't want to pull the shared db/journal device in most cases). > > My current thinking is that which lights should be on is persistently > stored by Ceph, and raises a HEALTH_WARN (or HEALTH_INFO, nudge nudge) > alert so that the operator knows that the light(s) are (still) on. > > How to run nmcli > ---------------- > > We can pretty trivially invoke 'lsmcli local-disk-fault-led-off --path > whatever' (or do something more minimal using the python bindings). The > gotcha is that we have to have something running on that host in order to > do it. > > So, it would be pretty easy for an osd to ident its device(s) when it is > up, but if it's not up, then... not so much. > > A few options: > > 1) Only do the ident/fault from a running OSD. This is pretty limiting, > and also runs the danger of not being able to turn the light off (if the > OSD then goes down). > > 2) Trigger the lights from any OSD (or possibly other daemon) that happens > to be running on the same host. This probably covers most cases, but.. > it's still a bit limited. What if no OSDs are up? What if there is only > one OSD on the host and it is down? > > 3) Delegate this to the new orchestrator. Kube can just run this command > wherever we want. Ansible presumably can too. > > 4) Depend on the libstoragemgmt network service. nmcli is just one part > of the suite... there's also a REST API that lets you do stuff. There are > presumably certificates to configure and such to make it all work, though. > > Also, there are some implementation oddities. The on/off state source > of truth is the enclosure itself. So if you turn the light off in ceph, > we need to be certain we turned it off with the device before we clear out > our state. Maybe we have states like off, pending-on, on, pending-off, > and we don't transition from pending-foo to foo until we get a success > from the command that is supposed to toggle the light state. > > Thoughts? I think this is within striking distance (finally) and it would > be sweet to land it in nautilus! > > sage