monitor quorum

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Is anyone able to offer any advice on how to fix this?
I've tried re-injecting the monmap into mon03 as that was mentioned in the
mon troubleshooting docs, but that has not helped at all.  mon03 is still
stuck in the same electing state :(

I've increased the debug level on mon03 and it is reporting the following,
repeatedly:

2014-09-18 10:22:12.788061 7f30f9818700  5
mon.ceph-mon-03 at 2(electing).elector(947)
start -- can i be leader?
2014-09-18 10:22:12.788105 7f30f9818700  1
mon.ceph-mon-03 at 2(electing).elector(947)
init, last seen epoch 947
2014-09-18 10:22:12.788111 7f30f9818700  1 -- 10.1.1.66:6789/0 --> mon.0
10.1.1.64:6789/0 -- election(XXX propose 947) v5 -- ?+0 0x7f3104568dc0
2014-09-18 10:22:12.788129 7f30f9818700  1 -- 10.1.1.66:6789/0 --> mon.1
10.1.1.65:6789/0 -- election(XXX propose 947) v5 -- ?+0 0x7f3104568b00
2014-09-18 10:22:14.470715 7f30f7f14700  1 -- 10.1.1.66:6789/0 >> :/0
pipe(0x7f31020a5c00 sd=13 :6789 s=0 pgs=0 cs=0 l=0 c=0x7f31036be7e0).accept
sd=13 10.1.1.10:50568/0
2014-09-18 10:22:14.470926 7f30f7f14700 10 mon.ceph-mon-03 at 2(electing) e3
ms_verify_authorizer 10.1.1.10:0/1007970 client protocol 0
2014-09-18 10:22:14.471281 7f30f9017700  1 -- 10.1.1.66:6789/0 <== client.?
10.1.1.10:0/1007970 1 ==== auth(proto 0 30 bytes epoch 0) v1 ==== 60+0+0
(673663173 0 0) 0x7f310282d600 con 0x7f31036be7e0
2014-09-18 10:22:14.471296 7f30f9017700  5 mon.ceph-mon-03 at 2(electing) e3
waitlisting message auth(proto 0 30 bytes epoch 0) v1
2014-09-18 10:22:14.866689 7f30f9818700  5 mon.ceph-mon-03 at 2(electing) e3
waitlisting message auth(proto 0 30 bytes epoch 0) v1

2014-09-18 10:22:17.470417 7f30f9017700 10 mon.ceph-mon-03 at 2(electing) e3
ms_handle_reset 0x7f31036be7e0 10.1.1.10:0/1007970
2014-09-18 10:22:17.788184 7f30f9818700  5
mon.ceph-mon-03 at 2(electing).elector(947)
election timer expired


J

On 17 September 2014 17:05, James Eckersall <james.eckersall at gmail.com>
wrote:

> Hi,
>
> Now I feel dumb for jumping to the conclusion that it was a simple
> networking issue - it isn't.
> I've just checked connectivity properly and I can ping and telnet 6789
> from all mon servers to all other mon servers.
>
> I've just restarted the mon03 service and the log is showing the following:
>
> 2014-09-17 16:49:02.355148 7f7ef9f8c800  0 starting mon.ceph-mon-03 rank 2
> at 10.1.1.66:6789/0 mon_data /var/lib/ceph/mon/ceph-ceph-mon-03 fsid
> 74069c87-b361-4bb8-8ce8-6ae9deb8a9bd
> 2014-09-17 16:49:02.355375 7f7ef9f8c800  1 mon.ceph-mon-03 at -1(probing) e2
> preinit fsid 74069c87-b361-4bb8-8ce8-6ae9deb8a9bd
> 2014-09-17 16:49:02.356347 7f7ef9f8c800  1 mon.ceph-mon-03 at -1(probing).paxosservice(pgmap
> 18241250..18241952) refresh upgraded, format 0 -> 1
> 2014-09-17 16:49:02.356360 7f7ef9f8c800  1 mon.ceph-mon-03 at -1(probing).pg
> v0 on_upgrade discarding in-core PGMap
> 2014-09-17 16:49:02.400316 7f7ef9f8c800  0 mon.ceph-mon-03 at -1(probing).mds
> e1 print_map
> epoch 1
> flags 0
> created 2013-12-09 10:19:58.534310
> modified 2013-12-09 10:19:58.534332
> tableserver 0
> root 0
> session_timeout 60
> session_autoclose 300
> max_file_size 1099511627776
> last_failure 0
> last_failure_osd_epoch 0
> compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable
> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
> uses versioned encoding}
> max_mds 1
> in
> up {}
> failed
> stopped
> data_pools 0
> metadata_pool 1
> inline_data disabled
>
> 2014-09-17 16:49:02.402373 7f7ef9f8c800  0 mon.ceph-mon-03 at -1(probing).osd
> e49212 crush map has features 1107558400, adjusting msgr requires
> 2014-09-17 16:49:02.402384 7f7ef9f8c800  0 mon.ceph-mon-03 at -1(probing).osd
> e49212 crush map has features 1107558400, adjusting msgr requires
> 2014-09-17 16:49:02.402386 7f7ef9f8c800  0 mon.ceph-mon-03 at -1(probing).osd
> e49212 crush map has features 1107558400, adjusting msgr requires
> 2014-09-17 16:49:02.402388 7f7ef9f8c800  0 mon.ceph-mon-03 at -1(probing).osd
> e49212 crush map has features 1107558400, adjusting msgr requires
> 2014-09-17 16:49:02.403725 7f7ef9f8c800  1 mon.ceph-mon-03 at -1(probing).paxosservice(auth
> 26001..26154) refresh upgraded, format 0 -> 1
> 2014-09-17 16:49:02.404834 7f7ef9f8c800  0 mon.ceph-mon-03 at -1(probing) e2
>  my rank is now 2 (was -1)
> 2014-09-17 16:49:02.407439 7f7ef331b700  1 mon.ceph-mon-03 at 2(synchronizing)
> e2 sync_obtain_latest_monmap
> 2014-09-17 16:49:02.407588 7f7ef331b700  1 mon.ceph-mon-03 at 2(synchronizing)
> e2 sync_obtain_latest_monmap obtained monmap e2
> 2014-09-17 16:49:09.514365 7f7ef331b700  0 log [INF] : mon.ceph-mon-03
> calling new monitor election
> 2014-09-17 16:49:09.514523 7f7ef331b700  1 mon.ceph-mon-03 at 2(electing).elector(931)
> init, last seen epoch 931
> 2014-09-17 16:49:09.514658 7f7ef331b700  1 mon.ceph-mon-03 at 2(electing).paxos(paxos
> recovering c 31223899..31224482) is_readable now=2014-09-17 16:49:09.514659
> lease_expire=0.000000 has v0 lc 31224482
> 2014-09-17 16:49:09.514665 7f7ef331b700  1 mon.ceph-mon-03 at 2(electing).paxos(paxos
> recovering c 31223899..31224482) is_readable now=2014-09-17 16:49:09.514666
> lease_expire=0.000000 has v0 lc 31224482
> 2014-09-17 16:49:15.533876 7f7ef3b1c700  1 mon.ceph-mon-03 at 2(electing).elector(933)
> init, last seen epoch 933
> 2014-09-17 16:49:21.578269 7f7ef3b1c700  1 mon.ceph-mon-03 at 2(electing).elector(935)
> init, last seen epoch 935
> 2014-09-17 16:49:26.578526 7f7ef3b1c700  1 mon.ceph-mon-03 at 2(electing).elector(935)
> init, last seen epoch 935
> 2014-09-17 16:49:31.578790 7f7ef3b1c700  1 mon.ceph-mon-03 at 2(electing).elector(935)
> init, last seen epoch 935
> 2014-09-17 16:49:36.579044 7f7ef3b1c700  1 mon.ceph-mon-03 at 2(electing).elector(935)
> init, last seen epoch 935
>
>
> The last lines about "electing" repeat forever.  The other mons are
> logging far more entries than I have seen them log before.  They look like
> the following (note the timestamps - all of these log lines are from just a
> 2 second period):
>
> 2014-09-17 16:55:10.019407 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225038) is_readable now=2014-09-17 16:55:10.019408
> lease_expire=2014-09-17 16:55:14.518716 has v0 lc 31225038
> 2014-09-17 16:55:10.019418 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225038) is_readable now=2014-09-17 16:55:10.019418
> lease_expire=2014-09-17 16:55:14.518716 has v0 lc 31225038
> 2014-09-17 16:55:10.180220 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225038) is_readable now=2014-09-17 16:55:10.180222
> lease_expire=2014-09-17 16:55:14.518716 has v0 lc 31225038
> 2014-09-17 16:55:10.180233 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225038) is_readable now=2014-09-17 16:55:10.180234
> lease_expire=2014-09-17 16:55:14.518716 has v0 lc 31225038
> 2014-09-17 16:55:10.192668 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225038) is_readable now=2014-09-17 16:55:10.192670
> lease_expire=2014-09-17 16:55:14.518716 has v0 lc 31225038
> 2014-09-17 16:55:10.192691 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225038) is_readable now=2014-09-17 16:55:10.192692
> lease_expire=2014-09-17 16:55:14.518716 has v0 lc 31225038
> 2014-09-17 16:55:10.276726 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225038) is_readable now=2014-09-17 16:55:10.276727
> lease_expire=2014-09-17 16:55:14.518716 has v0 lc 31225038
> 2014-09-17 16:55:10.276737 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225038) is_readable now=2014-09-17 16:55:10.276737
> lease_expire=2014-09-17 16:55:14.518716 has v0 lc 31225038
> 2014-09-17 16:55:10.302638 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225038) is_readable now=2014-09-17 16:55:10.302640
> lease_expire=2014-09-17 16:55:14.518716 has v0 lc 31225038
> 2014-09-17 16:55:10.302651 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225038) is_readable now=2014-09-17 16:55:10.302652
> lease_expire=2014-09-17 16:55:14.518716 has v0 lc 31225038
> 2014-09-17 16:55:10.362642 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225038) is_readable now=2014-09-17 16:55:10.362643
> lease_expire=2014-09-17 16:55:14.518716 has v0 lc 31225038
> 2014-09-17 16:55:10.362655 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225038) is_readable now=2014-09-17 16:55:10.362656
> lease_expire=2014-09-17 16:55:14.518716 has v0 lc 31225038
> 2014-09-17 16:55:10.385686 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225038) is_readable now=2014-09-17 16:55:10.385687
> lease_expire=2014-09-17 16:55:14.518716 has v0 lc 31225038
> 2014-09-17 16:55:10.385697 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225038) is_readable now=2014-09-17 16:55:10.385697
> lease_expire=2014-09-17 16:55:14.518716 has v0 lc 31225038
> 2014-09-17 16:55:10.406712 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225038) is_readable now=2014-09-17 16:55:10.406713
> lease_expire=2014-09-17 16:55:14.518716 has v0 lc 31225038
> 2014-09-17 16:55:10.406723 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225038) is_readable now=2014-09-17 16:55:10.406724
> lease_expire=2014-09-17 16:55:14.518716 has v0 lc 31225038
> 2014-09-17 16:55:10.423277 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225038) is_readable now=2014-09-17 16:55:10.423279
> lease_expire=2014-09-17 16:55:14.518716 has v0 lc 31225038
> 2014-09-17 16:55:10.423299 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225038) is_readable now=2014-09-17 16:55:10.423300
> lease_expire=2014-09-17 16:55:14.518716 has v0 lc 31225038
> 2014-09-17 16:55:10.543138 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> updating c 31224401..31225038) is_readable now=2014-09-17 16:55:10.543139
> lease_expire=0.000000 has v0 lc 31225038
> 2014-09-17 16:55:10.543145 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> updating c 31224401..31225038) is_readable now=2014-09-17 16:55:10.543145
> lease_expire=0.000000 has v0 lc 31225038
> 2014-09-17 16:55:10.580911 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225039) is_readable now=2014-09-17 16:55:10.580912
> lease_expire=2014-09-17 16:55:15.549947 has v0 lc 31225039
> 2014-09-17 16:55:10.580922 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225039) is_readable now=2014-09-17 16:55:10.580923
> lease_expire=2014-09-17 16:55:15.549947 has v0 lc 31225039
> 2014-09-17 16:55:10.580930 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225039) is_readable now=2014-09-17 16:55:10.580930
> lease_expire=2014-09-17 16:55:15.549947 has v0 lc 31225039
> 2014-09-17 16:55:10.606130 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> updating c 31224401..31225039) is_readable now=2014-09-17 16:55:10.606131
> lease_expire=0.000000 has v0 lc 31225039
> 2014-09-17 16:55:10.606136 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> updating c 31224401..31225039) is_readable now=2014-09-17 16:55:10.606137
> lease_expire=0.000000 has v0 lc 31225039
> 2014-09-17 16:55:10.633460 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).log
> v12645471 check_sub sending message to client.2190462 10.1.1.10:0/1004032
> with 1 entries (version 12645471)
> 2014-09-17 16:55:10.633632 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225040) is_readable now=2014-09-17 16:55:10.633633
> lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040
> 2014-09-17 16:55:10.633646 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225040) is_readable now=2014-09-17 16:55:10.633651
> lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040
> 2014-09-17 16:55:10.633657 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225040) is_readable now=2014-09-17 16:55:10.633658
> lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040
> 2014-09-17 16:55:10.633699 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225040) is_readable now=2014-09-17 16:55:10.633700
> lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040
> 2014-09-17 16:55:10.633707 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225040) is_readable now=2014-09-17 16:55:10.633707
> lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040
> 2014-09-17 16:55:10.695127 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225040) is_readable now=2014-09-17 16:55:10.695129
> lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040
> 2014-09-17 16:55:10.695151 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225040) is_readable now=2014-09-17 16:55:10.695152
> lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040
> 2014-09-17 16:55:10.800013 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225040) is_readable now=2014-09-17 16:55:10.800015
> lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040
> 2014-09-17 16:55:10.800030 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225040) is_readable now=2014-09-17 16:55:10.800031
> lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040
> 2014-09-17 16:55:10.830432 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225040) is_readable now=2014-09-17 16:55:10.830433
> lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040
> 2014-09-17 16:55:10.830441 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225040) is_readable now=2014-09-17 16:55:10.830442
> lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040
> 2014-09-17 16:55:10.848954 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225040) is_readable now=2014-09-17 16:55:10.848956
> lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040
> 2014-09-17 16:55:10.848964 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225040) is_readable now=2014-09-17 16:55:10.848965
> lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040
> 2014-09-17 16:55:10.887139 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225040) is_readable now=2014-09-17 16:55:10.887140
> lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040
> 2014-09-17 16:55:10.887150 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225040) is_readable now=2014-09-17 16:55:10.887151
> lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040
> 2014-09-17 16:55:10.913825 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225040) is_readable now=2014-09-17 16:55:10.913827
> lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040
> 2014-09-17 16:55:10.913834 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225040) is_readable now=2014-09-17 16:55:10.913835
> lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040
> 2014-09-17 16:55:11.010277 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225040) is_readable now=2014-09-17 16:55:11.010279
> lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040
> 2014-09-17 16:55:11.010287 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225040) is_readable now=2014-09-17 16:55:11.010288
> lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040
> 2014-09-17 16:55:11.098312 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225040) is_readable now=2014-09-17 16:55:11.098314
> lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040
> 2014-09-17 16:55:11.098325 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225040) is_readable now=2014-09-17 16:55:11.098326
> lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040
> 2014-09-17 16:55:11.109040 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225040) is_readable now=2014-09-17 16:55:11.109042
> lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040
> 2014-09-17 16:55:11.109053 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225040) is_readable now=2014-09-17 16:55:11.109054
> lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040
> 2014-09-17 16:55:11.170705 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225040) is_readable now=2014-09-17 16:55:11.170706
> lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040
> 2014-09-17 16:55:11.170713 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225040) is_readable now=2014-09-17 16:55:11.170714
> lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040
> 2014-09-17 16:55:11.222537 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225040) is_readable now=2014-09-17 16:55:11.222539
> lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040
> 2014-09-17 16:55:11.222549 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225040) is_readable now=2014-09-17 16:55:11.222550
> lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040
> 2014-09-17 16:55:11.431510 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225040) is_readable now=2014-09-17 16:55:11.431511
> lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040
> 2014-09-17 16:55:11.431524 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225040) is_readable now=2014-09-17 16:55:11.431525
> lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040
> 2014-09-17 16:55:11.453664 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225040) is_readable now=2014-09-17 16:55:11.453666
> lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040
> 2014-09-17 16:55:11.453685 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225040) is_readable now=2014-09-17 16:55:11.453687
> lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040
> 2014-09-17 16:55:11.520250 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225040) is_readable now=2014-09-17 16:55:11.520252
> lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040
> 2014-09-17 16:55:11.520263 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225040) is_readable now=2014-09-17 16:55:11.520264
> lease_expire=2014-09-17 16:55:15.607320 has v0 lc 31225040
> 2014-09-17 16:55:11.603991 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225041) is_readable now=2014-09-17 16:55:11.603992
> lease_expire=2014-09-17 16:55:16.575181 has v0 lc 31225041
> 2014-09-17 16:55:11.610948 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225041) is_readable now=2014-09-17 16:55:11.610949
> lease_expire=2014-09-17 16:55:16.575181 has v0 lc 31225041
> 2014-09-17 16:55:11.610965 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225041) is_readable now=2014-09-17 16:55:11.610966
> lease_expire=2014-09-17 16:55:16.575181 has v0 lc 31225041
> 2014-09-17 16:55:11.622479 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225041) is_readable now=2014-09-17 16:55:11.622480
> lease_expire=2014-09-17 16:55:16.575181 has v0 lc 31225041
> 2014-09-17 16:55:11.622495 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225041) is_readable now=2014-09-17 16:55:11.622496
> lease_expire=2014-09-17 16:55:16.575181 has v0 lc 31225041
> 2014-09-17 16:55:11.787013 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225041) is_readable now=2014-09-17 16:55:11.787014
> lease_expire=2014-09-17 16:55:16.575181 has v0 lc 31225041
> 2014-09-17 16:55:11.787024 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225041) is_readable now=2014-09-17 16:55:11.787025
> lease_expire=2014-09-17 16:55:16.575181 has v0 lc 31225041
> 2014-09-17 16:55:11.873613 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225041) is_readable now=2014-09-17 16:55:11.873614
> lease_expire=2014-09-17 16:55:16.575181 has v0 lc 31225041
> 2014-09-17 16:55:11.873627 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225041) is_readable now=2014-09-17 16:55:11.873628
> lease_expire=2014-09-17 16:55:16.575181 has v0 lc 31225041
> 2014-09-17 16:55:11.988465 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225041) is_readable now=2014-09-17 16:55:11.988467
> lease_expire=2014-09-17 16:55:16.575181 has v0 lc 31225041
> 2014-09-17 16:55:11.988487 7fd5a479a700  1 mon.ceph-mon-02 at 1(peon).paxos(paxos
> active c 31224401..31225041) is_readable now=2014-09-17 16:55:11.988489
> lease_expire=2014-09-17 16:55:16.575181 has v0 lc 31225041
>
>
> I'm wondering at this point whether I should just reinject the monmap from
> mon01 or mon02 into mon03 or whether there is something else that can be
> done to fix this.
>
> With hindsight, I would have stopped the mon service before relocating the
> nic cable, but I expected the mon to survive a short network outage which
> it doesn't seem to have done :(
>
>
> On 17 September 2014 16:21, James Eckersall <james.eckersall at gmail.com>
> wrote:
>
>> Hi,
>>
>> Thanks for the advice.
>>
>> I feel pretty dumb as it does indeed look like a simple networking issue.
>>  You know how you check things 5 times and miss the most obvious one...
>>
>> J
>>
>> On 17 September 2014 16:04, Florian Haas <florian at hastexo.com> wrote:
>>
>>> On Wed, Sep 17, 2014 at 1:58 PM, James Eckersall
>>> <james.eckersall at gmail.com> wrote:
>>> > Hi,
>>> >
>>> > I have a ceph cluster running 0.80.1 on Ubuntu 14.04.  I have 3
>>> monitors and
>>> > 4 OSD nodes currently.
>>> >
>>> > Everything has been running great up until today where I've got an
>>> issue
>>> > with the monitors.
>>> > I moved mon03 to a different switchport so it would have temporarily
>>> lost
>>> > connectivity.
>>> > Since then, the cluster is reporting that that mon is down, although
>>> it's
>>> > definitely up.
>>> > I've tried restarting the mon services on all three mons, but that
>>> hasn't
>>> > made a difference.
>>> > I definitely, 100% do not have any clock skew on any of the mons.
>>> This has
>>> > been triple-checked as the ceph docs seem to suggest that might be the
>>> cause
>>> > of this issue.
>>> >
>>> > Here is what ceph -s and ceph health detail are reporting as well as
>>> the
>>> > mon_status for each monitor:
>>> >
>>> >
>>> > # ceph -s ; ceph health detail
>>> >     cluster XXX
>>> >      health HEALTH_WARN 1 mons down, quorum 0,1 ceph-mon-01,ceph-mon-02
>>> >      monmap e2: 3 mons at
>>> > {ceph-mon-01=
>>> 10.1.1.64:6789/0,ceph-mon-02=10.1.1.65:6789/0,ceph-mon-03=10.1.1.66:6789/0
>>> },
>>> > election epoch 932, quorum 0,1 ceph-mon-01,ceph-mon-02
>>> >      osdmap e49213: 80 osds: 80 up, 80 in
>>> >       pgmap v18242952: 4864 pgs, 5 pools, 69910 GB data, 17638 kobjects
>>> >             197 TB used, 95904 GB / 290 TB avail
>>> >                    8 active+clean+scrubbing+deep
>>> >                 4856 active+clean
>>> >   client io 6893 kB/s rd, 5657 kB/s wr, 2090 op/s
>>> > HEALTH_WARN 1 mons down, quorum 0,1 ceph-mon-01,ceph-mon-02
>>> > mon.ceph-mon-03 (rank 2) addr 10.1.1.66:6789/0 is down (out of quorum)
>>> >
>>> >
>>> > { "name": "ceph-mon-01",
>>> >   "rank": 0,
>>> >   "state": "leader",
>>> >   "election_epoch": 932,
>>> >   "quorum": [
>>> >         0,
>>> >         1],
>>> >   "outside_quorum": [],
>>> >   "extra_probe_peers": [],
>>> >   "sync_provider": [],
>>> >   "monmap": { "epoch": 2,
>>> >       "fsid": "XXX",
>>> >       "modified": "0.000000",
>>> >       "created": "0.000000",
>>> >       "mons": [
>>> >             { "rank": 0,
>>> >               "name": "ceph-mon-01",
>>> >               "addr": "10.1.1.64:6789\/0"},
>>> >             { "rank": 1,
>>> >               "name": "ceph-mon-02",
>>> >               "addr": "10.1.1.65:6789\/0"},
>>> >             { "rank": 2,
>>> >               "name": "ceph-mon-03",
>>> >               "addr": "10.1.1.66:6789\/0"}]}}
>>> >
>>> >
>>> > { "name": "ceph-mon-02",
>>> >   "rank": 1,
>>> >   "state": "peon",
>>> >   "election_epoch": 932,
>>> >   "quorum": [
>>> >         0,
>>> >         1],
>>> >   "outside_quorum": [],
>>> >   "extra_probe_peers": [],
>>> >   "sync_provider": [],
>>> >   "monmap": { "epoch": 2,
>>> >       "fsid": "XXX",
>>> >       "modified": "0.000000",
>>> >       "created": "0.000000",
>>> >       "mons": [
>>> >             { "rank": 0,
>>> >               "name": "ceph-mon-01",
>>> >               "addr": "10.1.1.64:6789\/0"},
>>> >             { "rank": 1,
>>> >               "name": "ceph-mon-02",
>>> >               "addr": "10.1.1.65:6789\/0"},
>>> >             { "rank": 2,
>>> >               "name": "ceph-mon-03",
>>> >               "addr": "10.1.1.66:6789\/0"}]}}
>>> >
>>> >
>>> > { "name": "ceph-mon-03",
>>> >   "rank": 2,
>>> >   "state": "electing",
>>> >   "election_epoch": 931,
>>> >   "quorum": [],
>>> >   "outside_quorum": [],
>>> >   "extra_probe_peers": [],
>>> >   "sync_provider": [],
>>> >   "monmap": { "epoch": 2,
>>> >       "fsid": "XXX",
>>> >       "modified": "0.000000",
>>> >       "created": "0.000000",
>>> >       "mons": [
>>> >             { "rank": 0,
>>> >               "name": "ceph-mon-01",
>>> >               "addr": "10.1.1.64:6789\/0"},
>>> >             { "rank": 1,
>>> >               "name": "ceph-mon-02",
>>> >               "addr": "10.1.1.65:6789\/0"},
>>> >             { "rank": 2,
>>> >               "name": "ceph-mon-03",
>>> >               "addr": "10.1.1.66:6789\/0"}]}}
>>> >
>>> >
>>> > Any help or advice is appreciated.
>>>
>>> It looks like your mon has been unable to communicate with the other
>>> hosts, presumably since the time you un-/replugged it. Check your
>>> switch port configuration. Also, make sure that from 10.1.1.66, you
>>> can not only ping 10.1.1.64 and 10.1.1.65, but make a TCP connection
>>> on port 6789. With that out of the way, check your mon log on
>>> ceph-mon-03 (in /var/log/ceph/mon); it should provide some additional
>>> insight into the problem.
>>>
>>> Cheers,
>>> Florian
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140918/5a5ebcdc/attachment.htm>


[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux