Re: Monitor failure after series of traumatic network failures

Sage Weil <sage@xxxxxxxxxxxx> · Wed, 18 Mar 2015 14:24:51 -0700 (PDT)

On Wed, 18 Mar 2015, Greg Chavez wrote:
> We have a cuttlefish (0.61.9) 192-OSD cluster that has lost network
> availability several times since this past Thursday and whose nodes were all
> rebooted twice (hastily and inadvisably each time). The final reboot, which
> was supposed to be "the last thing" before recovery according to our data
> center team, resulted in a failure of the cluster's 4 monitors. This
> happened yesterday afternoon.
> 
> [ By the way, we use Ceph to back Cinder and Glance in our OpenStack Cloud,
> block storage only; also this network problems were the result of our data
> center team executing maintenance on our switches that was supposed to be
> quick and painless ]
> 
> After working all day on various troubleshooting techniques found here and
> there, we have this situation on our monitor nodes (debug 20):
> 
> 
> node-10: dead. ceph-mon will not start
> 
> node-14: Seemed to rebuild its monmap. The log has stopped reporting with
> this final tail -100: http://pastebin.com/tLiq2ewV
> 
> node-16: Same as 14, similar outcome in the
> log: http://pastebin.com/W87eT7Mw
> 
> node-15: ceph-mon starts but even at debug 20, it will only ouput this line,
> over and over again:
> 
>        2015-03-18 14:54:35.859511 7f8c82ad3700 -1 asok(0x2e560e0)
> AdminSocket: request 'mon_status' not defined
>                
> node-02: I added this guy to replace node-10. I updated ceph.conf and pushed
> it to all the monitor nodes (the osd nodes without monitors did not get the
> config push). Since he's a new guy the log out is obviously different, but
> again, here are the last 50 lines: http://pastebin.com/pfixdD3d
> 
> 
> I run my ceph client from my OpenStack controller. All ceph -s shows me is
> faults, albeit only to node-15
> 
> 2015-03-18 16:47:27.145194 7ff762cff700  0 -- 192.168.241.100:0/15112 >>
> 192.168.241.115:6789/0 pipe(0x7ff75000cf00 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
> 
> 
> Finally, here is our ceph.conf: http://pastebin.com/Gmiq2V8S
> 
> So that's where we stand. Did we kill our Ceph Cluster (and thus our
> OpenStack Cloud)?

Unlikely!  You have 5 copies, and I doubt all of them are unrecoverable.

> Or is there hope? Any suggestions would be greatly
> appreciated.

Stop all mons.

Make a backup copy of each mon data dir.

Copy the node-14 data dir over the node-15 and/or node-10 and/or 
node-02.

Start all mons, see if they form a quorum.

Once things are working again, at the *very* least upgrade to dumpling, 
and preferably then upgrade to firefly!!  Cuttlefish was EOL more than a 
year ago, and dumpling is EOL in a couple months.

sage
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com