Re: blinking lights

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 24.10.18 08:06, Jan Fajerski wrote:
> On Tue, Oct 23, 2018 at 11:08:57PM +0000, Sage Weil wrote:
>> I gave the latest lsmcli (libstoragemgmt) another try and it can
>> blink the
>> HDD lights on my generic 2u supermicro boxes!  It was a bit of a hassle
>> because ubuntu has an ancient version packaged, but once I built from
>> source it can do 'ident' (blinky red light) or 'fault' (solid red
>> light).
>> Pretty simple!  And now is the time to harass the ubuntu/debian folks to
>> get this into the next round of releases so we can take advantage of it
>> (Fedora/RHEL/CentOS should already have a good version.)
>>
>> With the new device tracking that's coming in nautilus, I think we have
>> most of the pieces to surface useful ceph controls to turn lights on and
>> off.  For example,
>>
>> $ ceph device ls
>> DEVICE                                  HOST:DEV      DAEMONS LIFE
>> EXPECTANCY
>> Crucial_CT1024M550SSD1_14160C164100     stud:sdd      osd.40  >6w
>> Crucial_CT1024M550SSD1_14210C25B79E     eutow:sds     osd.19  >6w
>>
>> So we could add
>>
>> $ ceph device ident-on Crucial_CT1024M550SSD1_14160C164100
>> $ ceph device fault-on Crucial_CT1024M550SSD1_14210C25B79E
>> ...
>> $ ceph device ident-off Crucial_CT1024M550SSD1_14160C164100
>> $ ceph device fault-off Crucial_CT1024M550SSD1_14210C25B79E
>>
>> or perhaps
>>
>> $ ceph osd ident-on osd.123
>> $ ceph osd fault-on osd.124
> I'd prefer this. Maybe by default only the data device, with a flag to
> optionally blink the shared journal/db device?
I'm in favor of this as well.
>> (although not that osds maybe backed by multiple devices, and you
>> probably
>> don't want to pull the shared db/journal device in most cases).
>>
>> My current thinking is that which lights should be on is persistently
>> stored by Ceph, and raises a HEALTH_WARN (or HEALTH_INFO, nudge nudge)
>> alert so that the operator knows that the light(s) are (still) on.
>>
>> How to run nmcli
>> ----------------
>>
>> We can pretty trivially invoke 'lsmcli local-disk-fault-led-off --path
>> whatever' (or do something more minimal using the python bindings).  The
>> gotcha is that we have to have something running on that host in
>> order to
>> do it.
>>
>> So, it would be pretty easy for an osd to ident its device(s) when it is
>> up, but if it's not up, then... not so much.
>>
>> A few options:
>>
>> 1) Only do the ident/fault from a running OSD.  This is pretty limiting,
>> and also runs the danger of not being able to turn the light off (if the
>> OSD then goes down).
>>
>> 2) Trigger the lights from any OSD (or possibly other daemon) that
>> happens
>> to be running on the same host.  This probably covers most cases, but..
>> it's still a bit limited.  What if no OSDs are up?  What if there is
>> only
>> one OSD on the host and it is down?
>>
>> 3) Delegate this to the new orchestrator.  Kube can just run this
>> command
>> wherever we want.  Ansible presumably can too.
> Imho this is the way to go. DeepSea was actually about to start
> working on this, so great timing :)
> One other detail: while I'm sure libstorage is getting better with
> time, I'm equally sure there will always be hardware that does not
> play along. We were going to make the actual command configurable so
> user can drop in whatever they need for this. Going the operator
> route, this might not be ceph's concern anymore, just thought I'd
> mention it.
That make absolute sense to me. We should think about a way to make the
command that is triggered at least configurable in case someone uses a
different type of controller and they're not completely up in the air.
>>
>> 4) Depend on the libstoragemgmt network service.  nmcli is just one part
>> of the suite... there's also a REST API that lets you do stuff. 
>> There are
>> presumably certificates to configure and such to make it all work,
>> though.
>>
>> Also, there are some implementation oddities.  The on/off state source
>> of truth is the enclosure itself.  So if you turn the light off in ceph,
>> we need to be certain we turned it off with the device before we
>> clear out
>> our state.  Maybe we have states like off, pending-on, on, pending-off,
>> and we don't transition from pending-foo to foo until we get a success
>> from the command that is supposed to toggle the light state.
>>
>> Thoughts?  I think this is within striking distance (finally) and it
>> would
>> be sweet to land it in nautilus!
>>
>> sage
>>
>
-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)


Attachment: signature.asc
Description: OpenPGP digital signature


[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux