Is anyone able to offer any advice on how to fix this? I've tried re-injecting the monmap into mon03 as that was mentioned in the mon troubleshooting docs, but that has not helped at all. mon03 is still stuck in the same electing state :( I've increased the debug level on mon03 and it is reporting the following, repeatedly: 2014-09-18 10:22:12.788061 7f30f9818700 5 mon.ceph-mon-03 at 2(electing).elector(947) start -- can i be leader? 2014-09-18 10:22:12.788105 7f30f9818700 1 mon.ceph-mon-03 at 2(electing).elector(947) init, last seen epoch 947 2014-09-18 10:22:12.788111 7f30f9818700 1 -- 10.1.1.66:6789/0 --> mon.0 10.1.1.64:6789/0 -- election(XXX propose 947) v5 -- ?+0 0x7f3104568dc0 2014-09-18 10:22:12.788129 7f30f9818700 1 -- 10.1.1.66:6789/0 --> mon.1 10.1.1.65:6789/0 -- election(XXX propose 947) v5 -- ?+0 0x7f3104568b00 2014-09-18 10:22:14.470715 7f30f7f14700 1 -- 10.1.1.66:6789/0 >> :/0 pipe(0x7f31020a5c00 sd=13 :6789 s=0 pgs=0 cs=0 l=0 c=0x7f31036be7e0).accept sd=13 10.1.1.10:50568/0 2014-09-18 10:22:14.470926 7f30f7f14700 10 mon.ceph-mon-03 at 2(electing) e3 ms_verify_authorizer 10.1.1.10:0/1007970 client protocol 0 2014-09-18 10:22:14.471281 7f30f9017700 1 -- 10.1.1.66:6789/0 <== client.? 10.1.1.10:0/1007970 1 ==== auth(proto 0 30 bytes epoch 0) v1 ==== 60+0+0 (673663173 0 0) 0x7f310282d600 con 0x7f31036be7e0 2014-09-18 10:22:14.471296 7f30f9017700 5 mon.ceph-mon-03 at 2(electing) e3 waitlisting message auth(proto 0 30 bytes epoch 0) v1 2014-09-18 10:22:14.866689 7f30f9818700 5 mon.ceph-mon-03 at 2(electing) e3 waitlisting message auth(proto 0 30 bytes epoch 0) v1 2014-09-18 10:22:17.470417 7f30f9017700 10 mon.ceph-mon-03 at 2(electing) e3 ms_handle_reset 0x7f31036be7e0 10.1.1.10:0/1007970 2014-09-18 10:22:17.788184 7f30f9818700 5 mon.ceph-mon-03 at 2(electing).elector(947) election timer expired J On 17 September 2014 17:05, James Eckersall <james.eckersall at gmail.com> wrote: > Hi, > > Now I feel dumb for jumping to the conclusion that it was a simple > networking issue - it isn't. > I've just checked connectivity properly and I can ping and telnet 6789 > from all mon servers to all other mon servers. > > I've just restarted the mon03 service and the log is showing the following: > > 2014-09-17 16:49:02.355148 7f7ef9f8c800 0 starting mon.ceph-mon-03 rank 2 > at 10.1.1.66:6789/0 mon_data /var/lib/ceph/mon/ceph-ceph-mon-03 fsid > 74069c87-b361-4bb8-8ce8-6ae9deb8a9bd > 2014-09-17 16:49:02.355375 7f7ef9f8c800 1 mon.ceph-mon-03 at -1(probing) e2 > preinit fsid 74069c87-b361-4bb8-8ce8-6ae9deb8a9bd > 2014-09-17 16:49:02.356347 7f7ef9f8c800 1 mon.ceph-mon-03 at -1(probing).paxosservice(pgmap > 18241250..18241952) refresh upgraded, format 0 -> 1 > 2014-09-17 16:49:02.356360 7f7ef9f8c800 1 mon.ceph-mon-03 at -1(probing).pg > v0 on_upgrade discarding in-core PGMap > 2014-09-17 16:49:02.400316 7f7ef9f8c800 0 mon.ceph-mon-03 at -1(probing).mds > e1 print_map > epoch 1 > flags 0 > created 2013-12-09 10:19:58.534310 > modified 2013-12-09 10:19:58.534332 > tableserver 0 > root 0 > session_timeout 60 > session_autoclose 300 > max_file_size 1099511627776 > last_failure 0 > last_failure_osd_epoch 0 > compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable > ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds > uses versioned encoding} > max_mds 1 > in > up {} > failed > stopped > data_pools 0 > metadata_pool 1 > inline_data disabled > > 2014-09-17 16:49:02.402373 7f7ef9f8c800 0 mon.ceph-mon-03 at -1(probing).osd > e49212 crush map has features 1107558400, adjusting msgr requires > 2014-09-17 16:49:02.402384 7f7ef9f8c800 0 mon.ceph-mon-03 at -1(probing).osd > e49212 crush map has features 1107558400, adjusting msgr requires > 2014-09-17 16:49:02.402386 7f7ef9f8c800 0 mon.ceph-mon-03 at -1(probing).osd > e49212 crush map has features 1107558400, adjusting msgr requires > 2014-09-17 16:49:02.402388 7f7ef9f8c800 0 mon.ceph-mon-03 at -1(probing).osd > e49212 crush map has features 1107558400, adjusting msgr requires > 2014-09-17 16:49:02.403725 7f7ef9f8c800 1 mon.ceph-mon-03 at -1(probing).paxosservice(auth > 26001..26154) refresh upgraded, format 0 -> 1 > 2014-09-17 16:49:02.404834 7f7ef9f8c800 0 mon.ceph-mon-03 at -1(probing) e2 > my rank is now 2 (was -1) > 2014-09-17 16:49:02.407439 7f7ef331b700 1 mon.ceph-mon-03 at 2(synchronizing) > e2 sync_obtain_latest_monmap > 2014-09-17 16:49:02.407588 7f7ef331b700 1 mon.ceph-mon-03 at 2(synchronizing) > e2 sync_obtain_latest_monmap obtained monmap e2 > 2014-09-17 16:49:09.514365 7f7ef331b700 0 log [INF] : mon.ceph-mon-03 > calling new monitor election > 2014-09-17 16:49:09.514523 7f7ef331b700 1 mon.ceph-mon-03 at 2(electing).elector(931) > init, last seen epoch 931 > 2014-09-17 16:49:09.514658 7f7ef331b700 1 mon.ceph-mon-03 at 2(electing).paxos(paxos > recovering c 31223899..31224482) is_readable now=2014-09-17 16:49:09.514659 > lease_expire=0.000000 has v0 lc 31224482 > 2014-09-17 16:49:09.514665 7f7ef331b700 1 mon.ceph-mon-03 at 2(electing).paxos(paxos > recovering c 31223899..31224482) is_readable now=2014-09-17 16:49:09.514666 > lease_expire=0.000000 has v0 lc 31224482 > 2014-09-17 16:49:15.533876 7f7ef3b1c700 1 mon.ceph-mon-03 at 2(electing).elector(933) > init, last seen epoch 933 > 2014-09-17 16:49:21.578269 7f7ef3b1c700 1 mon.ceph-mon-03 at 2(electing).elector(935) > init, last seen epoch 935 > 2014-09-17 16:49:26.578526 7f7ef3b1c700 1 mon.ceph-mon-03 at 2(electing).elector(935) > init, last seen epoch 935 > 2014-09-17 16:49:31.578790 7f7ef3b1c700 1 mon.ceph-mon-03 at 2(electing).elector(935) > init, last seen epoch 935 > 2014-09-17 16:49:36.579044 7f7ef3b1c700 1 mon.ceph-mon-03 at 2(electing).elector(935) > init, last seen epoch 935 > > > The last lines about "electing" repeat forever. The other mons are > logging far more entries than I have seen them log before. They look like > the following (note the timestamps - all of these log lines are from just a > 2 second period): > > 2014-09-17 16:55:10.019407 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225038) is_readable now=2014-09-17 16:55:10.019408 > lease_expire=2014-09-17 16:55:14.518716 has v0 lc 31225038 > 2014-09-17 16:55:10.019418 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225038) is_readable now=2014-09-17 16:55:10.019418 > lease_expire=2014-09-17 16:55:14.518716 has v0 lc 31225038 > 2014-09-17 16:55:10.180220 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225038) is_readable now=2014-09-17 16:55:10.180222 > lease_expire=2014-09-17 16:55:14.518716 has v0 lc 31225038 > 2014-09-17 16:55:10.180233 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225038) is_readable now=2014-09-17 16:55:10.180234 > lease_expire=2014-09-17 16:55:14.518716 has v0 lc 31225038 > 2014-09-17 16:55:10.192668 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225038) is_readable now=2014-09-17 16:55:10.192670 > lease_expire=2014-09-17 16:55:14.518716 has v0 lc 31225038 > 2014-09-17 16:55:10.192691 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225038) is_readable now=2014-09-17 16:55:10.192692 > lease_expire=2014-09-17 16:55:14.518716 has v0 lc 31225038 > 2014-09-17 16:55:10.276726 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225038) is_readable now=2014-09-17 16:55:10.276727 > lease_expire=2014-09-17 16:55:14.518716 has v0 lc 31225038 > 2014-09-17 16:55:10.276737 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225038) is_readable now=2014-09-17 16:55:10.276737 > lease_expire=2014-09-17 16:55:14.518716 has v0 lc 31225038 > 2014-09-17 16:55:10.302638 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225038) is_readable now=2014-09-17 16:55:10.302640 > lease_expire=2014-09-17 16:55:14.518716 has v0 lc 31225038 > 2014-09-17 16:55:10.302651 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225038) is_readable now=2014-09-17 16:55:10.302652 > lease_expire=2014-09-17 16:55:14.518716 has v0 lc 31225038 > 2014-09-17 16:55:10.362642 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225038) is_readable now=2014-09-17 16:55:10.362643 > lease_expire=2014-09-17 16:55:14.518716 has v0 lc 31225038 > 2014-09-17 16:55:10.362655 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225038) is_readable now=2014-09-17 16:55:10.362656 > lease_expire=2014-09-17 16:55:14.518716 has v0 lc 31225038 > 2014-09-17 16:55:10.385686 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225038) is_readable now=2014-09-17 16:55:10.385687 > lease_expire=2014-09-17 16:55:14.518716 has v0 lc 31225038 > 2014-09-17 16:55:10.385697 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225038) is_readable now=2014-09-17 16:55:10.385697 > lease_expire=2014-09-17 16:55:14.518716 has v0 lc 31225038 > 2014-09-17 16:55:10.406712 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225038) is_readable now=2014-09-17 16:55:10.406713 > lease_expire=2014-09-17 16:55:14.518716 has v0 lc 31225038 > 2014-09-17 16:55:10.406723 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225038) is_readable now=2014-09-17 16:55:10.406724 > lease_expire=2014-09-17 16:55:14.518716 has v0 lc 31225038 > 2014-09-17 16:55:10.423277 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225038) is_readable now=2014-09-17 16:55:10.423279 > lease_expire=2014-09-17 16:55:14.518716 has v0 lc 31225038 > 2014-09-17 16:55:10.423299 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225038) is_readable now=2014-09-17 16:55:10.423300 > lease_expire=2014-09-17 16:55:14.518716 has v0 lc 31225038 > 2014-09-17 16:55:10.543138 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > updating c 31224401..31225038) is_readable now=2014-09-17 16:55:10.543139 > lease_expire=0.000000 has v0 lc 31225038 > 2014-09-17 16:55:10.543145 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > updating c 31224401..31225038) is_readable now=2014-09-17 16:55:10.543145 > lease_expire=0.000000 has v0 lc 31225038 > 2014-09-17 16:55:10.580911 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225039) is_readable now=2014-09-17 16:55:10.580912 > lease_expire=2014-09-17 16:55:15.549947 has v0 lc 31225039 > 2014-09-17 16:55:10.580922 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225039) is_readable now=2014-09-17 16:55:10.580923 > lease_expire=2014-09-17 16:55:15.549947 has v0 lc 31225039 > 2014-09-17 16:55:10.580930 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225039) is_readable now=2014-09-17 16:55:10.580930 > lease_expire=2014-09-17 16:55:15.549947 has v0 lc 31225039 > 2014-09-17 16:55:10.606130 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > updating c 31224401..31225039) is_readable now=2014-09-17 16:55:10.606131 > lease_expire=0.000000 has v0 lc 31225039 > 2014-09-17 16:55:10.606136 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > updating c 31224401..31225039) is_readable now=2014-09-17 16:55:10.606137 > lease_expire=0.000000 has v0 lc 31225039 > 2014-09-17 16:55:10.633460 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).log > v12645471 check_sub sending message to client.2190462 10.1.1.10:0/1004032 > with 1 entries (version 12645471) > 2014-09-17 16:55:10.633632 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225040) is_readable now=2014-09-17 16:55:10.633633 > lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040 > 2014-09-17 16:55:10.633646 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225040) is_readable now=2014-09-17 16:55:10.633651 > lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040 > 2014-09-17 16:55:10.633657 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225040) is_readable now=2014-09-17 16:55:10.633658 > lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040 > 2014-09-17 16:55:10.633699 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225040) is_readable now=2014-09-17 16:55:10.633700 > lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040 > 2014-09-17 16:55:10.633707 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225040) is_readable now=2014-09-17 16:55:10.633707 > lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040 > 2014-09-17 16:55:10.695127 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225040) is_readable now=2014-09-17 16:55:10.695129 > lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040 > 2014-09-17 16:55:10.695151 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225040) is_readable now=2014-09-17 16:55:10.695152 > lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040 > 2014-09-17 16:55:10.800013 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225040) is_readable now=2014-09-17 16:55:10.800015 > lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040 > 2014-09-17 16:55:10.800030 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225040) is_readable now=2014-09-17 16:55:10.800031 > lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040 > 2014-09-17 16:55:10.830432 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225040) is_readable now=2014-09-17 16:55:10.830433 > lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040 > 2014-09-17 16:55:10.830441 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225040) is_readable now=2014-09-17 16:55:10.830442 > lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040 > 2014-09-17 16:55:10.848954 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225040) is_readable now=2014-09-17 16:55:10.848956 > lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040 > 2014-09-17 16:55:10.848964 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225040) is_readable now=2014-09-17 16:55:10.848965 > lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040 > 2014-09-17 16:55:10.887139 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225040) is_readable now=2014-09-17 16:55:10.887140 > lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040 > 2014-09-17 16:55:10.887150 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225040) is_readable now=2014-09-17 16:55:10.887151 > lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040 > 2014-09-17 16:55:10.913825 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225040) is_readable now=2014-09-17 16:55:10.913827 > lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040 > 2014-09-17 16:55:10.913834 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225040) is_readable now=2014-09-17 16:55:10.913835 > lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040 > 2014-09-17 16:55:11.010277 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225040) is_readable now=2014-09-17 16:55:11.010279 > lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040 > 2014-09-17 16:55:11.010287 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225040) is_readable now=2014-09-17 16:55:11.010288 > lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040 > 2014-09-17 16:55:11.098312 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225040) is_readable now=2014-09-17 16:55:11.098314 > lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040 > 2014-09-17 16:55:11.098325 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225040) is_readable now=2014-09-17 16:55:11.098326 > lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040 > 2014-09-17 16:55:11.109040 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225040) is_readable now=2014-09-17 16:55:11.109042 > lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040 > 2014-09-17 16:55:11.109053 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225040) is_readable now=2014-09-17 16:55:11.109054 > lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040 > 2014-09-17 16:55:11.170705 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225040) is_readable now=2014-09-17 16:55:11.170706 > lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040 > 2014-09-17 16:55:11.170713 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225040) is_readable now=2014-09-17 16:55:11.170714 > lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040 > 2014-09-17 16:55:11.222537 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225040) is_readable now=2014-09-17 16:55:11.222539 > lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040 > 2014-09-17 16:55:11.222549 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225040) is_readable now=2014-09-17 16:55:11.222550 > lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040 > 2014-09-17 16:55:11.431510 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225040) is_readable now=2014-09-17 16:55:11.431511 > lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040 > 2014-09-17 16:55:11.431524 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225040) is_readable now=2014-09-17 16:55:11.431525 > lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040 > 2014-09-17 16:55:11.453664 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225040) is_readable now=2014-09-17 16:55:11.453666 > lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040 > 2014-09-17 16:55:11.453685 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225040) is_readable now=2014-09-17 16:55:11.453687 > lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040 > 2014-09-17 16:55:11.520250 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225040) is_readable now=2014-09-17 16:55:11.520252 > lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040 > 2014-09-17 16:55:11.520263 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225040) is_readable now=2014-09-17 16:55:11.520264 > lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040 > 2014-09-17 16:55:11.603991 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225041) is_readable now=2014-09-17 16:55:11.603992 > lease_expire=2014-09-17 16:55:16.575181 has v0 lc 31225041 > 2014-09-17 16:55:11.610948 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225041) is_readable now=2014-09-17 16:55:11.610949 > lease_expire=2014-09-17 16:55:16.575181 has v0 lc 31225041 > 2014-09-17 16:55:11.610965 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225041) is_readable now=2014-09-17 16:55:11.610966 > lease_expire=2014-09-17 16:55:16.575181 has v0 lc 31225041 > 2014-09-17 16:55:11.622479 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225041) is_readable now=2014-09-17 16:55:11.622480 > lease_expire=2014-09-17 16:55:16.575181 has v0 lc 31225041 > 2014-09-17 16:55:11.622495 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225041) is_readable now=2014-09-17 16:55:11.622496 > lease_expire=2014-09-17 16:55:16.575181 has v0 lc 31225041 > 2014-09-17 16:55:11.787013 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225041) is_readable now=2014-09-17 16:55:11.787014 > lease_expire=2014-09-17 16:55:16.575181 has v0 lc 31225041 > 2014-09-17 16:55:11.787024 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225041) is_readable now=2014-09-17 16:55:11.787025 > lease_expire=2014-09-17 16:55:16.575181 has v0 lc 31225041 > 2014-09-17 16:55:11.873613 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225041) is_readable now=2014-09-17 16:55:11.873614 > lease_expire=2014-09-17 16:55:16.575181 has v0 lc 31225041 > 2014-09-17 16:55:11.873627 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225041) is_readable now=2014-09-17 16:55:11.873628 > lease_expire=2014-09-17 16:55:16.575181 has v0 lc 31225041 > 2014-09-17 16:55:11.988465 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225041) is_readable now=2014-09-17 16:55:11.988467 > lease_expire=2014-09-17 16:55:16.575181 has v0 lc 31225041 > 2014-09-17 16:55:11.988487 7fd5a479a700 1 mon.ceph-mon-02 at 1(peon).paxos(paxos > active c 31224401..31225041) is_readable now=2014-09-17 16:55:11.988489 > lease_expire=2014-09-17 16:55:16.575181 has v0 lc 31225041 > > > I'm wondering at this point whether I should just reinject the monmap from > mon01 or mon02 into mon03 or whether there is something else that can be > done to fix this. > > With hindsight, I would have stopped the mon service before relocating the > nic cable, but I expected the mon to survive a short network outage which > it doesn't seem to have done :( > > > On 17 September 2014 16:21, James Eckersall <james.eckersall at gmail.com> > wrote: > >> Hi, >> >> Thanks for the advice. >> >> I feel pretty dumb as it does indeed look like a simple networking issue. >> You know how you check things 5 times and miss the most obvious one... >> >> J >> >> On 17 September 2014 16:04, Florian Haas <florian at hastexo.com> wrote: >> >>> On Wed, Sep 17, 2014 at 1:58 PM, James Eckersall >>> <james.eckersall at gmail.com> wrote: >>> > Hi, >>> > >>> > I have a ceph cluster running 0.80.1 on Ubuntu 14.04. I have 3 >>> monitors and >>> > 4 OSD nodes currently. >>> > >>> > Everything has been running great up until today where I've got an >>> issue >>> > with the monitors. >>> > I moved mon03 to a different switchport so it would have temporarily >>> lost >>> > connectivity. >>> > Since then, the cluster is reporting that that mon is down, although >>> it's >>> > definitely up. >>> > I've tried restarting the mon services on all three mons, but that >>> hasn't >>> > made a difference. >>> > I definitely, 100% do not have any clock skew on any of the mons. >>> This has >>> > been triple-checked as the ceph docs seem to suggest that might be the >>> cause >>> > of this issue. >>> > >>> > Here is what ceph -s and ceph health detail are reporting as well as >>> the >>> > mon_status for each monitor: >>> > >>> > >>> > # ceph -s ; ceph health detail >>> > cluster XXX >>> > health HEALTH_WARN 1 mons down, quorum 0,1 ceph-mon-01,ceph-mon-02 >>> > monmap e2: 3 mons at >>> > {ceph-mon-01= >>> 10.1.1.64:6789/0,ceph-mon-02=10.1.1.65:6789/0,ceph-mon-03=10.1.1.66:6789/0 >>> }, >>> > election epoch 932, quorum 0,1 ceph-mon-01,ceph-mon-02 >>> > osdmap e49213: 80 osds: 80 up, 80 in >>> > pgmap v18242952: 4864 pgs, 5 pools, 69910 GB data, 17638 kobjects >>> > 197 TB used, 95904 GB / 290 TB avail >>> > 8 active+clean+scrubbing+deep >>> > 4856 active+clean >>> > client io 6893 kB/s rd, 5657 kB/s wr, 2090 op/s >>> > HEALTH_WARN 1 mons down, quorum 0,1 ceph-mon-01,ceph-mon-02 >>> > mon.ceph-mon-03 (rank 2) addr 10.1.1.66:6789/0 is down (out of quorum) >>> > >>> > >>> > { "name": "ceph-mon-01", >>> > "rank": 0, >>> > "state": "leader", >>> > "election_epoch": 932, >>> > "quorum": [ >>> > 0, >>> > 1], >>> > "outside_quorum": [], >>> > "extra_probe_peers": [], >>> > "sync_provider": [], >>> > "monmap": { "epoch": 2, >>> > "fsid": "XXX", >>> > "modified": "0.000000", >>> > "created": "0.000000", >>> > "mons": [ >>> > { "rank": 0, >>> > "name": "ceph-mon-01", >>> > "addr": "10.1.1.64:6789\/0"}, >>> > { "rank": 1, >>> > "name": "ceph-mon-02", >>> > "addr": "10.1.1.65:6789\/0"}, >>> > { "rank": 2, >>> > "name": "ceph-mon-03", >>> > "addr": "10.1.1.66:6789\/0"}]}} >>> > >>> > >>> > { "name": "ceph-mon-02", >>> > "rank": 1, >>> > "state": "peon", >>> > "election_epoch": 932, >>> > "quorum": [ >>> > 0, >>> > 1], >>> > "outside_quorum": [], >>> > "extra_probe_peers": [], >>> > "sync_provider": [], >>> > "monmap": { "epoch": 2, >>> > "fsid": "XXX", >>> > "modified": "0.000000", >>> > "created": "0.000000", >>> > "mons": [ >>> > { "rank": 0, >>> > "name": "ceph-mon-01", >>> > "addr": "10.1.1.64:6789\/0"}, >>> > { "rank": 1, >>> > "name": "ceph-mon-02", >>> > "addr": "10.1.1.65:6789\/0"}, >>> > { "rank": 2, >>> > "name": "ceph-mon-03", >>> > "addr": "10.1.1.66:6789\/0"}]}} >>> > >>> > >>> > { "name": "ceph-mon-03", >>> > "rank": 2, >>> > "state": "electing", >>> > "election_epoch": 931, >>> > "quorum": [], >>> > "outside_quorum": [], >>> > "extra_probe_peers": [], >>> > "sync_provider": [], >>> > "monmap": { "epoch": 2, >>> > "fsid": "XXX", >>> > "modified": "0.000000", >>> > "created": "0.000000", >>> > "mons": [ >>> > { "rank": 0, >>> > "name": "ceph-mon-01", >>> > "addr": "10.1.1.64:6789\/0"}, >>> > { "rank": 1, >>> > "name": "ceph-mon-02", >>> > "addr": "10.1.1.65:6789\/0"}, >>> > { "rank": 2, >>> > "name": "ceph-mon-03", >>> > "addr": "10.1.1.66:6789\/0"}]}} >>> > >>> > >>> > Any help or advice is appreciated. >>> >>> It looks like your mon has been unable to communicate with the other >>> hosts, presumably since the time you un-/replugged it. Check your >>> switch port configuration. Also, make sure that from 10.1.1.66, you >>> can not only ping 10.1.1.64 and 10.1.1.65, but make a TCP connection >>> on port 6789. With that out of the way, check your mon log on >>> ceph-mon-03 (in /var/log/ceph/mon); it should provide some additional >>> insight into the problem. >>> >>> Cheers, >>> Florian >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140918/5a5ebcdc/attachment.htm>