Re: How does monitor know OSD is dead?

"Brian :" <brians@xxxxxxxx> · Wed, 3 Jul 2019 00:30:37 +0100

I wouldn't say that's a pretty common failure. The flaw here perhaps is the design of the cluster and that it was relying on a single power source. Power sources fail. Dual power supplies connected to a b power sources in the data centre is pretty standard. 

On Tuesday, July 2, 2019, Bryan Henderson <bryanh@xxxxxxxxxxxxxxxx> wrote:
>> Normally in the case of a restart then somebody who used to have a
>> connection to the OSD would still be running and flag it as dead. But
>> if *all* the daemons in the cluster lose their soft state, that can't
>> happen.
>
> OK, thanks.  I guess that explains it.  But that's a pretty serious design
> flaw, isn't it?  What I experienced is a pretty common failure mode: a power
> outage caused the entire cluster to die simultaneously, then when power came
> back, some OSDs didn't (the most common time for a server to fail is at
> startup).
>
> I wonder if I could close this gap with additional monitoring of my own.  I
> could have a cluster bringup protocol that detects OSD processes that aren't
> running after a while and mark those OSDs down.  It would be cleaner, though,
> if I could just find out from the monitor what OSDs are in the map but not
> connected to the monitor cluster.  Is that possible?
>
> A related question: If I mark an OSD down administratively, does it stay down
> until I give a command to mark it back up, or will the monitor detect signs of
> life and declare it up again on its own?
>
> --
> Bryan Henderson                                   San Jose, California
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com