Re: Monitor failure after series of traumatic network failures

Greg Chavez <greg.chavez@xxxxxxxxx> · Tue, 24 Mar 2015 16:40:42 -0400

This was excellent advice. It should be on some official Ceph troubleshooting page. It takes a while for the monitors to deal with new info, but it works.
Thanks again!
--Greg

On Wed, Mar 18, 2015 at 5:24 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
On Wed, 18 Mar 2015, Greg Chavez wrote:

> We have a cuttlefish (0.61.9) 192-OSD cluster that has lost network

> availability several times since this past Thursday and whose nodes were all

> rebooted twice (hastily and inadvisably each time). The final reboot, which

> was supposed to be "the last thing" before recovery according to our data

> center team, resulted in a failure of the cluster's 4 monitors. This

> happened yesterday afternoon.

>

> [ By the way, we use Ceph to back Cinder and Glance in our OpenStack Cloud,

> block storage only; also this network problems were the result of our data

> center team executing maintenance on our switches that was supposed to be

> quick and painless ]

>

> After working all day on various troubleshooting techniques found here and

> there, we have this situation on our monitor nodes (debug 20):

>

>

> node-10: dead. ceph-mon will not start

>

> node-14: Seemed to rebuild its monmap. The log has stopped reporting with

> this final tail -100: http://pastebin.com/tLiq2ewV

>

> node-16: Same as 14, similar outcome in the

> log: http://pastebin.com/W87eT7Mw

>

> node-15: ceph-mon starts but even at debug 20, it will only ouput this line,

> over and over again:

>

>        2015-03-18 14:54:35.859511 7f8c82ad3700 -1 asok(0x2e560e0)

> AdminSocket: request 'mon_status' not defined

>                

> node-02: I added this guy to replace node-10. I updated ceph.conf and pushed

> it to all the monitor nodes (the osd nodes without monitors did not get the

> config push). Since he's a new guy the log out is obviously different, but

> again, here are the last 50 lines: http://pastebin.com/pfixdD3d

>

>

> I run my ceph client from my OpenStack controller. All ceph -s shows me is

> faults, albeit only to node-15

>

> 2015-03-18 16:47:27.145194 7ff762cff700  0 -- 192.168.241.100:0/15112 >>

> 192.168.241.115:6789/0 pipe(0x7ff75000cf00 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault

>

>

> Finally, here is our ceph.conf: http://pastebin.com/Gmiq2V8S

>

> So that's where we stand. Did we kill our Ceph Cluster (and thus our

> OpenStack Cloud)?

Unlikely!  You have 5 copies, and I doubt all of them are unrecoverable.

> Or is there hope? Any suggestions would be greatly

> appreciated.

Stop all mons.

Make a backup copy of each mon data dir.

Copy the node-14 data dir over the node-15 and/or node-10 and/or

node-02.

Start all mons, see if they form a quorum.

Once things are working again, at the *very* least upgrade to dumpling,

and preferably then upgrade to firefly!!  Cuttlefish was EOL more than a

year ago, and dumpling is EOL in a couple months.

sage

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com