What does it take for a monitor to consider an OSD down which has been dead as a doornail since the cluster started? A couple of times, I have seen 'ceph status' report an OSD was up, when it was quite dead. Recently, a couple of OSDs were on machines that failed to boot up after a power failure. The rest of the Ceph cluster came up, though, and reported all OSDs up and in. I/Os stalled, probably because they were waiting for the dead OSDs to come back. I waited 15 minutes, because the manual says if the monitor doesn't hear a heartbeat from an OSD in that long (default value of mon_osd_report_timeout), it marks it down. But it didn't. I did "osd down" commands for the dead OSDs and the status changed to down and I/O started working. And wouldn't even 15 minutes of grace be unacceptable if it means I/Os have to wait that long before falling back to a redundant OSD? -- Bryan Henderson San Jose, California _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com