On 05/10/2013 11:02 PM, Jeppesen, Nelson wrote:
After upgrading my cluster everything looked good, then I rebooted the farm and all hell broke loose. I have 3 monitors but none are able to start. On all of them the '/usr/bin/python /usr/sbin/ceph-create-keys' command is hanging because none of the nodes can accept quorum.
We would certainly be interested in taking a look at logs from those monitors, and would appreciate if you could set 'debug mon = 20', 'debug auth = 10' and 'debug ms = 1', and give them a spin until you hit your issue.
All ceph tools are producing the following fault: # ceph -w 2013-05-10 15:00:55.259382 7f6b68e0e700 0 -- :/20337 >> 10.1.1.21:6789/0 pipe(0x2fdc520 sd=4 :0 s=1 pgs=0 cs=0 l=1).fault …. Using mommaptool I removed all but one monitor and did the same to ceph.conf and tried running interactively and get the following:
Did you inject the monmap? It seems as if the monitor is still attempting to probe for the remaining monitors in the monmap, so that would be an indicator that although you changed the monmap, the monitor still sees the older map (which means the newer map wasn't injected).
Just in case, you can inject the monmap by running 'ceph-mon -i a --inject-monmap <monmap.file>'. You must first shutdown the monitor prior to injecting the monmap.
-Joao
Heres the mom output # /usr/bin/ceph-mon -i a --pid-file /var/run/ceph/mon.a.pid -c /etc/ceph/ceph.conf -d 2013-05-10 14:54:23.405324 7f0750a61780 0 ceph version 0.61 (237f3f1e8d8c3b85666529860285dcdffdeda4c5), process ceph-mon, pid 29289 starting mon.a rank 0 at 10.1.1.21:6789/0 mon_data /var/lib/ceph/mon/ceph-a fsid 969f28c3-5ee1-4451-9b5b-97c52b724a06 2013-05-10 14:54:23.455975 7f0750a61780 1 mon.a@-1(probing) e1 preinit fsid 969f28c3-5ee1-4451-9b5b-97c52b724a06 2013-05-10 14:54:23.820160 7f0750a61780 1 mon.a@-1(probing).osd e6666 e6666: 96 osds: 96 up, 96 in 2013-05-10 14:54:23.820372 7f0750a61780 1 mon.a@-1(probing).osd e6667 e6667: 96 osds: 96 up, 96 in 2013-05-10 14:54:23.820618 7f0750a61780 1 mon.a@-1(probing).osd e6668 e6668: 96 osds: 96 up, 96 in 2013-05-10 14:54:23.820802 7f0750a61780 1 mon.a@-1(probing).osd e6669 e6669: 96 osds: 96 up, 96 in 2013-05-10 14:54:23.820995 7f0750a61780 1 mon.a@-1(probing).osd e6670 e6670: 96 osds: 96 up, 96 in 2013-05-10 14:54:23.821180 7f0750a61780 1 mon.a@-1(probing).osd e6671 e6671: 96 osds: 96 up, 96 in 2013-05-10 14:54:23.821368 7f0750a61780 1 mon.a@-1(probing).osd e6672 e6672: 96 osds: 96 up, 96 in 2013-05-10 14:54:23.821549 7f0750a61780 1 mon.a@-1(probing).osd e6673 e6673: 96 osds: 96 up, 96 in 2013-05-10 14:54:23.821735 7f0750a61780 1 mon.a@-1(probing).osd e6674 e6674: 96 osds: 96 up, 96 in 2013-05-10 14:54:23.821981 7f0750a61780 1 mon.a@-1(probing).osd e6675 e6675: 96 osds: 96 up, 96 in 2013-05-10 14:54:23.822173 7f0750a61780 1 mon.a@-1(probing).osd e6676 e6676: 96 osds: 96 up, 96 in 2013-05-10 14:54:23.822353 7f0750a61780 1 mon.a@-1(probing).osd e6677 e6677: 96 osds: 96 up, 96 in 2013-05-10 14:54:23.822529 7f0750a61780 1 mon.a@-1(probing).osd e6678 e6678: 96 osds: 96 up, 96 in 2013-05-10 14:54:23.822698 7f0750a61780 1 mon.a@-1(probing).osd e6679 e6679: 96 osds: 96 up, 96 in 2013-05-10 14:54:23.822879 7f0750a61780 1 mon.a@-1(probing).osd e6680 e6680: 96 osds: 96 up, 96 in 2013-05-10 14:54:23.823056 7f0750a61780 1 mon.a@-1(probing).osd e6681 e6681: 96 osds: 96 up, 96 in 2013-05-10 14:54:23.823229 7f0750a61780 1 mon.a@-1(probing).osd e6682 e6682: 96 osds: 96 up, 96 in 2013-05-10 14:54:23.823403 7f0750a61780 1 mon.a@-1(probing).osd e6683 e6683: 96 osds: 96 up, 96 in 2013-05-10 14:54:23.823580 7f0750a61780 1 mon.a@-1(probing).osd e6684 e6684: 96 osds: 96 up, 96 in 2013-05-10 14:54:23.823749 7f0750a61780 1 mon.a@-1(probing).osd e6685 e6685: 96 osds: 96 up, 96 in 2013-05-10 14:54:23.823915 7f0750a61780 1 mon.a@-1(probing).osd e6686 e6686: 96 osds: 92 up, 96 in 2013-05-10 14:54:23.824088 7f0750a61780 1 mon.a@-1(probing).osd e6687 e6687: 96 osds: 88 up, 96 in 2013-05-10 14:54:23.824261 7f0750a61780 1 mon.a@-1(probing).osd e6688 e6688: 96 osds: 83 up, 96 in 2013-05-10 14:54:23.824434 7f0750a61780 1 mon.a@-1(probing).osd e6689 e6689: 96 osds: 71 up, 96 in 2013-05-10 14:54:23.824610 7f0750a61780 1 mon.a@-1(probing).osd e6690 e6690: 96 osds: 69 up, 96 in 2013-05-10 14:54:23.824793 7f0750a61780 1 mon.a@-1(probing).osd e6691 e6691: 96 osds: 56 up, 96 in 2013-05-10 14:54:23.838611 7f0750a61780 0 mon.a@-1(probing).osd e6691 crush map has features 33816576, adjusting msgr requires 2013-05-10 14:54:23.838630 7f0750a61780 0 mon.a@-1(probing).osd e6691 crush map has features 33816576, adjusting msgr requires 2013-05-10 14:54:23.838634 7f0750a61780 0 mon.a@-1(probing).osd e6691 crush map has features 33816576, adjusting msgr requires 2013-05-10 14:54:23.838636 7f0750a61780 0 mon.a@-1(probing).osd e6691 crush map has features 33816576, adjusting msgr requires 2013-05-10 14:54:23.841335 7f0750a61780 0 mon.a@-1(probing) e1 my rank is now 0 (was -1) 2013-05-10 14:54:23.842481 7f0748ff9700 0 -- 10.1.1.21:6789/0 >> 10.1.1.33:6789/0 pipe(0x204ba00 sd=41 :0 s=1 pgs=0 cs=0 l=0).fault 2013-05-10 14:54:23.842493 7f07490fa700 0 -- 10.1.1.21:6789/0 >> 10.1.1.22:6789/0 pipe(0x204bc80 sd=40 :0 s=1 pgs=0 cs=0 l=0).fault 2013-05-10 14:54:28.841438 7f074aaff700 1 mon.a@0(probing) e1 discarding message auth(proto 0 27 bytes epoch 1) v1 and sending client elsewhere 2013-05-10 14:54:28.841472 7f074aaff700 1 mon.a@0(probing) e1 discarding message auth(proto 0 27 bytes epoch 1) v1 and sending client elsewhere 2013-05-10 14:54:28.841483 7f074aaff700 1 mon.a@0(probing) e1 discarding message auth(proto 0 30 bytes epoch 1) v1 and sending client elsewhere 2013-05-10 14:54:28.841491 7f074aaff700 1 mon.a@0(probing) e1 discarding message auth(proto 0 27 bytes epoch 1) v1 and sending client elsewhere 2013-05-10 14:54:28.841499 7f074aaff700 1 mon.a@0(probing) e1 discarding message auth(proto 0 27 bytes epoch 1) v1 and sending client elsewhere 2013-05-10 14:54:28.841507 7f074aaff700 1 mon.a@0(probing) e1 discarding message auth(proto 0 27 bytes epoch 1) v1 and sending client elsewhere 2013-05-10 14:54:28.841515 7f074aaff700 1 mon.a@0(probing) e1 discarding message auth(proto 0 27 bytes epoch 1) v1 and sending client elsewhere 2013-05-10 14:54:28.841526 7f074aaff700 1 mon.a@0(probing) e1 discarding message auth(proto 0 27 bytes epoch 1) v1 and sending client elsewhere 2013-05-10 14:54:28.841540 7f074aaff700 1 mon.a@0(probing) e1 discarding message auth(proto 0 27 bytes epoch 1) v1 and sending client elsewhere 2013-05-10 14:54:28.841549 7f074aaff700 1 mon.a@0(probing) e1 discarding message auth(proto 0 27 bytes epoch 1) v1 and sending client elsewhere 2013-05-10 14:54:28.841556 7f074aaff700 1 mon.a@0(probing) e1 discarding message auth(proto 0 48 bytes epoch 1) v1 and sending client elsewhere 2013-05-10 14:54:28.841567 7f074aaff700 1 mon.a@0(probing) e1 discarding message auth(proto 0 27 bytes epoch 1) v1 and sending client elsewhere 2013-05-10 14:54:28.841578 7f074aaff700 1 mon.a@0(probing) e1 discarding message auth(proto 0 27 bytes epoch 1) v1 and sending client elsewhere 2013-05-10 14:54:28.841585 7f074aaff700 1 mon.a@0(probing) e1 discarding message auth(proto 0 27 bytes epoch 1) v1 and sending client elsewhere …. Nelson Jeppesen _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
-- Joao Eduardo Luis Software Engineer | http://inktank.com | http://ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com