Re: Monitor failure after series of traumatic network failures

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



This was excellent advice. It should be on some official Ceph troubleshooting page. It takes a while for the monitors to deal with new info, but it works.

Thanks again!
--Greg

On Wed, Mar 18, 2015 at 5:24 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
On Wed, 18 Mar 2015, Greg Chavez wrote:
> We have a cuttlefish (0.61.9) 192-OSD cluster that has lost network
> availability several times since this past Thursday and whose nodes were all
> rebooted twice (hastily and inadvisably each time). The final reboot, which
> was supposed to be "the last thing" before recovery according to our data
> center team, resulted in a failure of the cluster's 4 monitors. This
> happened yesterday afternoon.
>
> [ By the way, we use Ceph to back Cinder and Glance in our OpenStack Cloud,
> block storage only; also this network problems were the result of our data
> center team executing maintenance on our switches that was supposed to be
> quick and painless ]
>
> After working all day on various troubleshooting techniques found here and
> there, we have this situation on our monitor nodes (debug 20):
>
>
> node-10: dead. ceph-mon will not start
>
> node-14: Seemed to rebuild its monmap. The log has stopped reporting with
> this final tail -100: http://pastebin.com/tLiq2ewV
>
> node-16: Same as 14, similar outcome in the
> log: http://pastebin.com/W87eT7Mw
>
> node-15: ceph-mon starts but even at debug 20, it will only ouput this line,
> over and over again:
>
>        2015-03-18 14:54:35.859511 7f8c82ad3700 -1 asok(0x2e560e0)
> AdminSocket: request 'mon_status' not defined
>                
> node-02: I added this guy to replace node-10. I updated ceph.conf and pushed
> it to all the monitor nodes (the osd nodes without monitors did not get the
> config push). Since he's a new guy the log out is obviously different, but
> again, here are the last 50 lines: http://pastebin.com/pfixdD3d
>
>
> I run my ceph client from my OpenStack controller. All ceph -s shows me is
> faults, albeit only to node-15
>
> 2015-03-18 16:47:27.145194 7ff762cff700  0 -- 192.168.241.100:0/15112 >>
> 192.168.241.115:6789/0 pipe(0x7ff75000cf00 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
>
>
> Finally, here is our ceph.conf: http://pastebin.com/Gmiq2V8S
>
> So that's where we stand. Did we kill our Ceph Cluster (and thus our
> OpenStack Cloud)?

Unlikely!  You have 5 copies, and I doubt all of them are unrecoverable.

> Or is there hope? Any suggestions would be greatly
> appreciated.

Stop all mons.

Make a backup copy of each mon data dir.

Copy the node-14 data dir over the node-15 and/or node-10 and/or
node-02.

Start all mons, see if they form a quorum.

Once things are working again, at the *very* least upgrade to dumpling,
and preferably then upgrade to firefly!!  Cuttlefish was EOL more than a
year ago, and dumpling is EOL in a couple months.

sage

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux