Monitor failure after series of traumatic network failures

Greg Chavez <greg.chavez@xxxxxxxxx> · Wed, 18 Mar 2015 17:19:18 -0400

We have a cuttlefish (0.61.9) 192-OSD cluster that has lost network availability several times since this past Thursday and whose nodes were all rebooted twice (hastily and inadvisably each time). The final reboot, which was supposed to be "the last thing" before recovery according to our data center team, resulted in a failure of the cluster's 4 monitors. This happened yesterday afternoon.

[ By the way, we use Ceph to back Cinder and Glance in our OpenStack Cloud, block storage only; also this network problems were the result of our data center team executing maintenance on our switches that was supposed to be quick and painless ]

After working all day on various troubleshooting techniques found here and there, we have this situation on our monitor nodes (debug 20):

node-10: dead. ceph-mon will not start

node-14: Seemed to rebuild its monmap. The log has stopped reporting with this final tail -100: http://pastebin.com/tLiq2ewV

node-16: Same as 14, similar outcome in the log: http://pastebin.com/W87eT7Mw

node-15: ceph-mon starts but even at debug 20, it will only ouput this line, over and over again:

       2015-03-18 14:54:35.859511 7f8c82ad3700 -1 asok(0x2e560e0) AdminSocket: request 'mon_status' not defined

node-02: I added this guy to replace node-10. I updated ceph.conf and pushed it to all the monitor nodes (the osd nodes without monitors did not get the config push). Since he's a new guy the log out is obviously different, but again, here are the last 50 lines: http://pastebin.com/pfixdD3d

I run my ceph client from my OpenStack controller. All ceph -s shows me is faults, albeit only to node-15

2015-03-18 16:47:27.145194 7ff762cff700  0 -- 192.168.241.100:0/15112 >> 192.168.241.115:6789/0 pipe(0x7ff75000cf00 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault

Finally, here is our ceph.conf: http://pastebin.com/Gmiq2V8S

So that's where we stand. Did we kill our Ceph Cluster (and thus our OpenStack Cloud)? Or is there hope? Any suggestions would be greatly appreciated.

-- 
\*..+.-
--Greg Chavez
+//..;};

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com