On 24.10.18 08:06, Jan Fajerski wrote: > On Tue, Oct 23, 2018 at 11:08:57PM +0000, Sage Weil wrote: >> I gave the latest lsmcli (libstoragemgmt) another try and it can >> blink the >> HDD lights on my generic 2u supermicro boxes! It was a bit of a hassle >> because ubuntu has an ancient version packaged, but once I built from >> source it can do 'ident' (blinky red light) or 'fault' (solid red >> light). >> Pretty simple! And now is the time to harass the ubuntu/debian folks to >> get this into the next round of releases so we can take advantage of it >> (Fedora/RHEL/CentOS should already have a good version.) >> >> With the new device tracking that's coming in nautilus, I think we have >> most of the pieces to surface useful ceph controls to turn lights on and >> off. For example, >> >> $ ceph device ls >> DEVICE HOST:DEV DAEMONS LIFE >> EXPECTANCY >> Crucial_CT1024M550SSD1_14160C164100 stud:sdd osd.40 >6w >> Crucial_CT1024M550SSD1_14210C25B79E eutow:sds osd.19 >6w >> >> So we could add >> >> $ ceph device ident-on Crucial_CT1024M550SSD1_14160C164100 >> $ ceph device fault-on Crucial_CT1024M550SSD1_14210C25B79E >> ... >> $ ceph device ident-off Crucial_CT1024M550SSD1_14160C164100 >> $ ceph device fault-off Crucial_CT1024M550SSD1_14210C25B79E >> >> or perhaps >> >> $ ceph osd ident-on osd.123 >> $ ceph osd fault-on osd.124 > I'd prefer this. Maybe by default only the data device, with a flag to > optionally blink the shared journal/db device? I'm in favor of this as well. >> (although not that osds maybe backed by multiple devices, and you >> probably >> don't want to pull the shared db/journal device in most cases). >> >> My current thinking is that which lights should be on is persistently >> stored by Ceph, and raises a HEALTH_WARN (or HEALTH_INFO, nudge nudge) >> alert so that the operator knows that the light(s) are (still) on. >> >> How to run nmcli >> ---------------- >> >> We can pretty trivially invoke 'lsmcli local-disk-fault-led-off --path >> whatever' (or do something more minimal using the python bindings). The >> gotcha is that we have to have something running on that host in >> order to >> do it. >> >> So, it would be pretty easy for an osd to ident its device(s) when it is >> up, but if it's not up, then... not so much. >> >> A few options: >> >> 1) Only do the ident/fault from a running OSD. This is pretty limiting, >> and also runs the danger of not being able to turn the light off (if the >> OSD then goes down). >> >> 2) Trigger the lights from any OSD (or possibly other daemon) that >> happens >> to be running on the same host. This probably covers most cases, but.. >> it's still a bit limited. What if no OSDs are up? What if there is >> only >> one OSD on the host and it is down? >> >> 3) Delegate this to the new orchestrator. Kube can just run this >> command >> wherever we want. Ansible presumably can too. > Imho this is the way to go. DeepSea was actually about to start > working on this, so great timing :) > One other detail: while I'm sure libstorage is getting better with > time, I'm equally sure there will always be hardware that does not > play along. We were going to make the actual command configurable so > user can drop in whatever they need for this. Going the operator > route, this might not be ceph's concern anymore, just thought I'd > mention it. That make absolute sense to me. We should think about a way to make the command that is triggered at least configurable in case someone uses a different type of controller and they're not completely up in the air. >> >> 4) Depend on the libstoragemgmt network service. nmcli is just one part >> of the suite... there's also a REST API that lets you do stuff. >> There are >> presumably certificates to configure and such to make it all work, >> though. >> >> Also, there are some implementation oddities. The on/off state source >> of truth is the enclosure itself. So if you turn the light off in ceph, >> we need to be certain we turned it off with the device before we >> clear out >> our state. Maybe we have states like off, pending-on, on, pending-off, >> and we don't transition from pending-foo to foo until we get a success >> from the command that is supposed to toggle the light state. >> >> Thoughts? I think this is within striking distance (finally) and it >> would >> be sweet to land it in nautilus! >> >> sage >> > -- SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
Attachment:
signature.asc
Description: OpenPGP digital signature