> Normally in the case of a restart then somebody who used to have a > connection to the OSD would still be running and flag it as dead. But > if *all* the daemons in the cluster lose their soft state, that can't > happen. OK, thanks. I guess that explains it. But that's a pretty serious design flaw, isn't it? What I experienced is a pretty common failure mode: a power outage caused the entire cluster to die simultaneously, then when power came back, some OSDs didn't (the most common time for a server to fail is at startup). I wonder if I could close this gap with additional monitoring of my own. I could have a cluster bringup protocol that detects OSD processes that aren't running after a while and mark those OSDs down. It would be cleaner, though, if I could just find out from the monitor what OSDs are in the map but not connected to the monitor cluster. Is that possible? A related question: If I mark an OSD down administratively, does it stay down until I give a command to mark it back up, or will the monitor detect signs of life and declare it up again on its own? -- Bryan Henderson San Jose, California _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com