Re: How does monitor know OSD is dead?

Bryan Henderson <bryanh@xxxxxxxxxxxxxxxx> · 3 Jul 2019 18:22:34 +0000

> I'm a bit confused about what happened here, though: that 600 second 
> interval is only important if *every* OSD in the system is down.  If you 
> reboot the data center, why didn't *any* OSD daemons start?  (And even if 
> none did, having the ceph -s report all OSDs down instead of up isn't 
> going to change anything except whether your pager is going off, right?)

I think you got lost in the thread of discussion.  Enough OSDs for the cluster
to be fully functional _did_ come back.  But the cluster insisted on going to
the dead ones (which it claimed all the while were up) for some I/O, even
after running for 20 minutes that way, so the cluster was not functional.  The
600 second "mon osd down out interval" was a red herring.

It might be relevant that there was a grand total of three OSDs in the map.
One came up; two did not.  All objects were replicated across all three, with
the hope that this sort of thing would not be fatal.  It's a Jewel system with
that version's default of 1 for "mon osd min down reporters".

-- 
Bryan Henderson                                   San Jose, California
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com