Re: cluster not coming up after reboot

Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx> · Thu, 23 Apr 2015 09:58:13 -0700

On Thu, Apr 23, 2015 at 5:20 AM, Kenneth Waegeman 
So it is all fixed now, but is it explainable that at first about 90% of the OSDS going into shutdown over and over, and only after some time got in a stable situation, because of one host network failure ?

Thanks again!

Yes, unless you've adjusted:
[global]
  mon osd min down reporters = 9
  mon osd min down reports = 12

OSDs talk to the MONs on the public network.  The cluster network is only used for OSD to OSD communication.

If one OSD node can't talk on that network, the other nodes will tell the MONs that it's OSDs are down.  And that node will also tell the MONs that all the other OSDs are down.  Then the OSDs marked down will tell the MONs that they're not down, and the cycle will repeat.

I'm somewhat surprised that your cluster eventually stabilized.

I have 8 OSDs per node.  I set my min down reporters high enough that no single node can mark another node's OSDs down.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com