On Mon, Jul 1, 2019 at 8:56 PM Bryan Henderson <bryanh@xxxxxxxxxxxxxxxx> wrote: > > > Normally in the case of a restart then somebody who used to have a > > connection to the OSD would still be running and flag it as dead. But > > if *all* the daemons in the cluster lose their soft state, that can't > > happen. > > OK, thanks. I guess that explains it. But that's a pretty serious design > flaw, isn't it? What I experienced is a pretty common failure mode: a power > outage caused the entire cluster to die simultaneously, then when power came > back, some OSDs didn't (the most common time for a server to fail is at > startup). I am a little surprised; the peer OSDs used to detect this. But we've re-done the heartbeat logic a few times and both losing a whole data center's worth of daemons while not having monitoring to check if they turn on actually isn't that common. Can you create a tracker ticket with the version you're seeing it on and any non-default configuration options you've set? -Greg > > I wonder if I could close this gap with additional monitoring of my own. I > could have a cluster bringup protocol that detects OSD processes that aren't > running after a while and mark those OSDs down. It would be cleaner, though, > if I could just find out from the monitor what OSDs are in the map but not > connected to the monitor cluster. Is that possible? > > A related question: If I mark an OSD down administratively, does it stay down > until I give a command to mark it back up, or will the monitor detect signs of > life and declare it up again on its own? > > -- > Bryan Henderson San Jose, California _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com