Re: Help! 61.1 killed my monitors in prod

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 05/10/2013 11:02 PM, Jeppesen, Nelson wrote:
After upgrading my cluster everything looked good, then I rebooted the
farm and all hell broke loose.

I have 3 monitors  but none are able to start. On all of them the
'/usr/bin/python /usr/sbin/ceph-create-keys' command is hanging because
none of the nodes can accept quorum.

We would certainly be interested in taking a look at logs from those monitors, and would appreciate if you could set 'debug mon = 20', 'debug auth = 10' and 'debug ms = 1', and give them a spin until you hit your issue.


All ceph tools are producing the following fault:

# ceph -w

2013-05-10 15:00:55.259382 7f6b68e0e700  0 -- :/20337 >>
10.1.1.21:6789/0 pipe(0x2fdc520 sd=4 :0 s=1 pgs=0 cs=0 l=1).fault

….

Using mommaptool I removed all but one monitor and did the same to
ceph.conf and tried running interactively and get the following:

Did you inject the monmap? It seems as if the monitor is still attempting to probe for the remaining monitors in the monmap, so that would be an indicator that although you changed the monmap, the monitor still sees the older map (which means the newer map wasn't injected).

Just in case, you can inject the monmap by running 'ceph-mon -i a --inject-monmap <monmap.file>'. You must first shutdown the monitor prior to injecting the monmap.


  -Joao


Heres the mom output

# /usr/bin/ceph-mon -i a --pid-file /var/run/ceph/mon.a.pid -c
/etc/ceph/ceph.conf  -d

2013-05-10 14:54:23.405324 7f0750a61780  0 ceph version 0.61
(237f3f1e8d8c3b85666529860285dcdffdeda4c5), process ceph-mon, pid 29289

starting mon.a rank 0 at 10.1.1.21:6789/0 mon_data
/var/lib/ceph/mon/ceph-a fsid 969f28c3-5ee1-4451-9b5b-97c52b724a06

2013-05-10 14:54:23.455975 7f0750a61780  1 mon.a@-1(probing) e1 preinit
fsid 969f28c3-5ee1-4451-9b5b-97c52b724a06

2013-05-10 14:54:23.820160 7f0750a61780  1 mon.a@-1(probing).osd e6666
e6666: 96 osds: 96 up, 96 in

2013-05-10 14:54:23.820372 7f0750a61780  1 mon.a@-1(probing).osd e6667
e6667: 96 osds: 96 up, 96 in

2013-05-10 14:54:23.820618 7f0750a61780  1 mon.a@-1(probing).osd e6668
e6668: 96 osds: 96 up, 96 in

2013-05-10 14:54:23.820802 7f0750a61780  1 mon.a@-1(probing).osd e6669
e6669: 96 osds: 96 up, 96 in

2013-05-10 14:54:23.820995 7f0750a61780  1 mon.a@-1(probing).osd e6670
e6670: 96 osds: 96 up, 96 in

2013-05-10 14:54:23.821180 7f0750a61780  1 mon.a@-1(probing).osd e6671
e6671: 96 osds: 96 up, 96 in

2013-05-10 14:54:23.821368 7f0750a61780  1 mon.a@-1(probing).osd e6672
e6672: 96 osds: 96 up, 96 in

2013-05-10 14:54:23.821549 7f0750a61780  1 mon.a@-1(probing).osd e6673
e6673: 96 osds: 96 up, 96 in

2013-05-10 14:54:23.821735 7f0750a61780  1 mon.a@-1(probing).osd e6674
e6674: 96 osds: 96 up, 96 in

2013-05-10 14:54:23.821981 7f0750a61780  1 mon.a@-1(probing).osd e6675
e6675: 96 osds: 96 up, 96 in

2013-05-10 14:54:23.822173 7f0750a61780  1 mon.a@-1(probing).osd e6676
e6676: 96 osds: 96 up, 96 in

2013-05-10 14:54:23.822353 7f0750a61780  1 mon.a@-1(probing).osd e6677
e6677: 96 osds: 96 up, 96 in

2013-05-10 14:54:23.822529 7f0750a61780  1 mon.a@-1(probing).osd e6678
e6678: 96 osds: 96 up, 96 in

2013-05-10 14:54:23.822698 7f0750a61780  1 mon.a@-1(probing).osd e6679
e6679: 96 osds: 96 up, 96 in

2013-05-10 14:54:23.822879 7f0750a61780  1 mon.a@-1(probing).osd e6680
e6680: 96 osds: 96 up, 96 in

2013-05-10 14:54:23.823056 7f0750a61780  1 mon.a@-1(probing).osd e6681
e6681: 96 osds: 96 up, 96 in

2013-05-10 14:54:23.823229 7f0750a61780  1 mon.a@-1(probing).osd e6682
e6682: 96 osds: 96 up, 96 in

2013-05-10 14:54:23.823403 7f0750a61780  1 mon.a@-1(probing).osd e6683
e6683: 96 osds: 96 up, 96 in

2013-05-10 14:54:23.823580 7f0750a61780  1 mon.a@-1(probing).osd e6684
e6684: 96 osds: 96 up, 96 in

2013-05-10 14:54:23.823749 7f0750a61780  1 mon.a@-1(probing).osd e6685
e6685: 96 osds: 96 up, 96 in

2013-05-10 14:54:23.823915 7f0750a61780  1 mon.a@-1(probing).osd e6686
e6686: 96 osds: 92 up, 96 in

2013-05-10 14:54:23.824088 7f0750a61780  1 mon.a@-1(probing).osd e6687
e6687: 96 osds: 88 up, 96 in

2013-05-10 14:54:23.824261 7f0750a61780  1 mon.a@-1(probing).osd e6688
e6688: 96 osds: 83 up, 96 in

2013-05-10 14:54:23.824434 7f0750a61780  1 mon.a@-1(probing).osd e6689
e6689: 96 osds: 71 up, 96 in

2013-05-10 14:54:23.824610 7f0750a61780  1 mon.a@-1(probing).osd e6690
e6690: 96 osds: 69 up, 96 in

2013-05-10 14:54:23.824793 7f0750a61780  1 mon.a@-1(probing).osd e6691
e6691: 96 osds: 56 up, 96 in

2013-05-10 14:54:23.838611 7f0750a61780  0 mon.a@-1(probing).osd e6691
crush map has features 33816576, adjusting msgr requires

2013-05-10 14:54:23.838630 7f0750a61780  0 mon.a@-1(probing).osd e6691
crush map has features 33816576, adjusting msgr requires

2013-05-10 14:54:23.838634 7f0750a61780  0 mon.a@-1(probing).osd e6691
crush map has features 33816576, adjusting msgr requires

2013-05-10 14:54:23.838636 7f0750a61780  0 mon.a@-1(probing).osd e6691
crush map has features 33816576, adjusting msgr requires

2013-05-10 14:54:23.841335 7f0750a61780  0 mon.a@-1(probing) e1  my rank
is now 0 (was -1)

2013-05-10 14:54:23.842481 7f0748ff9700  0 -- 10.1.1.21:6789/0 >>
10.1.1.33:6789/0 pipe(0x204ba00 sd=41 :0 s=1 pgs=0 cs=0 l=0).fault

2013-05-10 14:54:23.842493 7f07490fa700  0 -- 10.1.1.21:6789/0 >>
10.1.1.22:6789/0 pipe(0x204bc80 sd=40 :0 s=1 pgs=0 cs=0 l=0).fault

2013-05-10 14:54:28.841438 7f074aaff700  1 mon.a@0(probing) e1
discarding message auth(proto 0 27 bytes epoch 1) v1 and sending client
elsewhere

2013-05-10 14:54:28.841472 7f074aaff700  1 mon.a@0(probing) e1
discarding message auth(proto 0 27 bytes epoch 1) v1 and sending client
elsewhere

2013-05-10 14:54:28.841483 7f074aaff700  1 mon.a@0(probing) e1
discarding message auth(proto 0 30 bytes epoch 1) v1 and sending client
elsewhere

2013-05-10 14:54:28.841491 7f074aaff700  1 mon.a@0(probing) e1
discarding message auth(proto 0 27 bytes epoch 1) v1 and sending client
elsewhere

2013-05-10 14:54:28.841499 7f074aaff700  1 mon.a@0(probing) e1
discarding message auth(proto 0 27 bytes epoch 1) v1 and sending client
elsewhere

2013-05-10 14:54:28.841507 7f074aaff700  1 mon.a@0(probing) e1
discarding message auth(proto 0 27 bytes epoch 1) v1 and sending client
elsewhere

2013-05-10 14:54:28.841515 7f074aaff700  1 mon.a@0(probing) e1
discarding message auth(proto 0 27 bytes epoch 1) v1 and sending client
elsewhere

2013-05-10 14:54:28.841526 7f074aaff700  1 mon.a@0(probing) e1
discarding message auth(proto 0 27 bytes epoch 1) v1 and sending client
elsewhere

2013-05-10 14:54:28.841540 7f074aaff700  1 mon.a@0(probing) e1
discarding message auth(proto 0 27 bytes epoch 1) v1 and sending client
elsewhere

2013-05-10 14:54:28.841549 7f074aaff700  1 mon.a@0(probing) e1
discarding message auth(proto 0 27 bytes epoch 1) v1 and sending client
elsewhere

2013-05-10 14:54:28.841556 7f074aaff700  1 mon.a@0(probing) e1
discarding message auth(proto 0 48 bytes epoch 1) v1 and sending client
elsewhere

2013-05-10 14:54:28.841567 7f074aaff700  1 mon.a@0(probing) e1
discarding message auth(proto 0 27 bytes epoch 1) v1 and sending client
elsewhere

2013-05-10 14:54:28.841578 7f074aaff700  1 mon.a@0(probing) e1
discarding message auth(proto 0 27 bytes epoch 1) v1 and sending client
elsewhere

2013-05-10 14:54:28.841585 7f074aaff700  1 mon.a@0(probing) e1
discarding message auth(proto 0 27 bytes epoch 1) v1 and sending client
elsewhere

….

Nelson Jeppesen



_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux