On Tuesday, July 2, 2019, Bryan Henderson <bryanh@xxxxxxxxxxxxxxxx> wrote:
>> Normally in the case of a restart then somebody who used to have a
>> connection to the OSD would still be running and flag it as dead. But
>> if *all* the daemons in the cluster lose their soft state, that can't
>> happen.
>
> OK, thanks. I guess that explains it. But that's a pretty serious design
> flaw, isn't it? What I experienced is a pretty common failure mode: a power
> outage caused the entire cluster to die simultaneously, then when power came
> back, some OSDs didn't (the most common time for a server to fail is at
> startup).
>
> I wonder if I could close this gap with additional monitoring of my own. I
> could have a cluster bringup protocol that detects OSD processes that aren't
> running after a while and mark those OSDs down. It would be cleaner, though,
> if I could just find out from the monitor what OSDs are in the map but not
> connected to the monitor cluster. Is that possible?
>
> A related question: If I mark an OSD down administratively, does it stay down
> until I give a command to mark it back up, or will the monitor detect signs of
> life and declare it up again on its own?
>
> --
> Bryan Henderson San Jose, California
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com