How does monitor know OSD is dead?

Bryan Henderson <bryanh@xxxxxxxxxxxxxxxx> · 28 Jun 2019 00:19:09 +0000

What does it take for a monitor to consider an OSD down which has been dead as
a doornail since the cluster started?

A couple of times, I have seen 'ceph status' report an OSD was up, when it was
quite dead.  Recently, a couple of OSDs were on machines that failed to boot
up after a power failure.  The rest of the Ceph cluster came up, though, and
reported all OSDs up and in.  I/Os stalled, probably because they were waiting
for the dead OSDs to come back.

I waited 15 minutes, because the manual says if the monitor doesn't hear a
heartbeat from an OSD in that long (default value of mon_osd_report_timeout),
it marks it down.  But it didn't.  I did "osd down" commands for the dead OSDs
and the status changed to down and I/O started working.

And wouldn't even 15 minutes of grace be unacceptable if it means I/Os have to
wait that long before falling back to a redundant OSD?

-- 
Bryan Henderson                                   San Jose, California
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com