Re: cluster not coming up after reboot

Kenneth Waegeman <kenneth.waegeman@xxxxxxxx> · Mon, 27 Apr 2015 10:45:45 +0200

On 04/23/2015 06:58 PM, Craig Lewis wrote:
Yes, unless you've adjusted:
[global]
   mon osd min down reporters = 9
   mon osd min down reports = 12

OSDs talk to the MONs on the public network.  The cluster network is
only used for OSD to OSD communication.

If one OSD node can't talk on that network, the other nodes will tell
the MONs that it's OSDs are down.  And that node will also tell the MONs
that all the other OSDs are down.  Then the OSDs marked down will tell
the MONs that they're not down, and the cycle will repeat.

Thanks for the explanation, that makes sense now! Good to know I should 
set those values:)

I'm somewhat surprised that your cluster eventually stabilized.
The OSDs of that one node were eventually set 'out' of the cluster. I 
guess the osds where down long enough to get marked out? (Or the 
monitors took some action after too many failures?) And then the other 
OSDs could stay up I guess:)

I have 8 OSDs per node.  I set my min down reporters high enough that no
single node can mark another node's OSDs down.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com