Re: Fwd: Mons stucking in election afther 3 Days offline

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 07/26/2018 11:50 AM, Benjamin Naber wrote:
> hi Wido,
> 
> got the folowing outputt since ive changed the debug setting:
> 

This is only debug_ms it seems?

debug_mon = 10
debug_ms = 10

Those two shoud be set where debug_mon will tell more about the election
process.

Wido

> 2018-07-26 11:46:21.004490 7f819e968700 10 -- 10.111.73.1:6789/0 >>
> 10.111.73.3:0/1033315403 conn(0x55aa46c4a800 :6789 s=STATE_OPEN pgs=71
> cs=1 l=1)._try_send sent bytes 9 remaining bytes 0
> 2018-07-26 11:46:21.004520 7f81a196e700 10 -- 10.111.73.1:6789/0
> dispatch_throttle_release 60 to dispatch throttler 60/104857600
> 2018-07-26 11:46:23.058057 7f81a4173700 1 -- 10.111.73.1:6789/0 >>
> 10.111.73.2:0/3994280291 conn(0x55aa46c46000 :6789 s=STATE_OPEN pgs=77
> cs=1 l=1).mark_down
> 2018-07-26 11:46:23.058084 7f81a4173700 2 -- 10.111.73.1:6789/0 >>
> 10.111.73.2:0/3994280291 conn(0x55aa46c46000 :6789 s=STATE_OPEN pgs=77
> cs=1 l=1)._stop
> 2018-07-26 11:46:23.058094 7f81a4173700 10 -- 10.111.73.1:6789/0 >>
> 10.111.73.2:0/3994280291 conn(0x55aa46c46000 :6789 s=STATE_OPEN pgs=77
> cs=1 l=1).discard_out_queue started
> 2018-07-26 11:46:23.058120 7f81a4173700 1 -- 10.111.73.1:6789/0 >>
> 10.111.73.3:0/1033315403 conn(0x55aa46c4a800 :6789 s=STATE_OPEN pgs=71
> cs=1 l=1).mark_down
> 2018-07-26 11:46:23.058131 7f81a4173700 2 -- 10.111.73.1:6789/0 >>
> 10.111.73.3:0/1033315403 conn(0x55aa46c4a800 :6789 s=STATE_OPEN pgs=71
> cs=1 l=1)._stop
> 2018-07-26 11:46:23.058143 7f81a4173700 10 -- 10.111.73.1:6789/0 >>
> 10.111.73.3:0/1033315403 conn(0x55aa46c4a800 :6789 s=STATE_OPEN pgs=71
> cs=1 l=1).discard_out_queue started
> 2018-07-26 11:46:23.962796 7f819d966700 10 Processor -- accept listen_fd=22
> 2018-07-26 11:46:23.962845 7f819d966700 10 Processor -- accept accepted
> incoming on sd 23
> 2018-07-26 11:46:23.962858 7f819d966700 10 -- 10.111.73.1:6789/0 >> -
> conn(0x55aa46afd800 :-1 s=STATE_NONE pgs=0 cs=0 l=0).accept sd=23
> 2018-07-26 11:46:23.962929 7f819e167700 1 -- 10.111.73.1:6789/0 >> -
> conn(0x55aa46afd800 :6789 s=STATE_ACCEPTING pgs=0 cs=0
> l=0)._process_connection sd=23 -
> 2018-07-26 11:46:23.963022 7f819e167700 10 -- 10.111.73.1:6789/0 >> -
> conn(0x55aa46afd800 :6789 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._try_send
> sent bytes 281 remaining bytes 0
> 2018-07-26 11:46:23.963045 7f819e167700 10 -- 10.111.73.1:6789/0 >> -
> conn(0x55aa46afd800 :6789 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
> l=0)._process_connection write banner and addr done: -
> 2018-07-26 11:46:23.963091 7f819e167700 10 -- 10.111.73.1:6789/0 >> -
> conn(0x55aa46afd800 :6789 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
> l=0)._process_connection accept peer addr is 10.111.73.1:0/1745436331
> 2018-07-26 11:46:23.963190 7f819e167700 10 -- 10.111.73.1:6789/0 >>
> 10.111.73.1:0/1745436331 conn(0x55aa46afd800 :6789
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> l=1)._process_connection accept of host_type 8, policy.lossy=1
> policy.server=1 policy.standby=0 policy.resetcheck=0
> 2018-07-26 11:46:23.963216 7f819e167700 10 -- 10.111.73.1:6789/0 >>
> 10.111.73.1:0/1745436331 conn(0x55aa46afd800 :6789
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> l=1).handle_connect_msg accept my proto 15, their proto 15
> 2018-07-26 11:46:23.963232 7f819e167700 10 -- 10.111.73.1:6789/0 >>
> 10.111.73.1:0/1745436331 conn(0x55aa46afd800 :6789
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> l=1).handle_connect_msg accept setting up session_security.
> 2018-07-26 11:46:23.963248 7f819e167700 10 -- 10.111.73.1:6789/0 >>
> 10.111.73.1:0/1745436331 conn(0x55aa46afd800 :6789
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> l=1).handle_connect_msg accept new session
> 2018-07-26 11:46:23.963256 7f819e167700 10 -- 10.111.73.1:6789/0 >>
> 10.111.73.1:0/1745436331 conn(0x55aa46afd800 :6789
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=87 cs=1
> l=1).handle_connect_msg accept success, connect_seq = 1 in_seq=0,
> sending READY
> 2018-07-26 11:46:23.963264 7f819e167700 10 -- 10.111.73.1:6789/0 >>
> 10.111.73.1:0/1745436331 conn(0x55aa46afd800 :6789
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=87 cs=1
> l=1).handle_connect_msg accept features 4611087853745930235
> 2018-07-26 11:46:23.963315 7f819e167700 10 -- 10.111.73.1:6789/0 >>
> 10.111.73.1:0/1745436331 conn(0x55aa46afd800 :6789
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=87 cs=1 l=1)._try_send sent
> bytes 34 remaining bytes 0
> 2018-07-26 11:46:23.963356 7f819e167700 2 -- 10.111.73.1:6789/0 >>
> 10.111.73.1:0/1745436331 conn(0x55aa46afd800 :6789
> s=STATE_ACCEPTING_WAIT_SEQ pgs=87 cs=1 l=1).handle_connect_msg accept
> write reply msg done
> 2018-07-26 11:46:23.963442 7f819e167700 2 -- 10.111.73.1:6789/0 >>
> 10.111.73.1:0/1745436331 conn(0x55aa46afd800 :6789
> s=STATE_ACCEPTING_WAIT_SEQ pgs=87 cs=1 l=1)._process_connection accept
> get newly_acked_seq 0
> 2018-07-26 11:46:23.963461 7f819e167700 10 -- 10.111.73.1:6789/0 >>
> 10.111.73.1:0/1745436331 conn(0x55aa46afd800 :6789
> s=STATE_ACCEPTING_WAIT_SEQ pgs=87 cs=1 l=1).discard_requeued_up_to 0
> 2018-07-26 11:46:23.963634 7f819e167700 10 -- 10.111.73.1:6789/0 >>
> 10.111.73.1:0/1745436331 conn(0x55aa46afd800 :6789
> s=STATE_OPEN_KEEPALIVE2 pgs=87 cs=1 l=1)._append_keepalive_or_ack
> 2018-07-26 11:46:23.963658 7f819e167700 10 -- 10.111.73.1:6789/0 >>
> 10.111.73.1:0/1745436331 conn(0x55aa46afd800 :6789
> s=STATE_OPEN_MESSAGE_THROTTLE_BYTES pgs=87 cs=1 l=1).process wants 60
> bytes from policy throttler 120/104857600
> 2018-07-26 11:46:23.963679 7f819e167700 10 -- 10.111.73.1:6789/0 >>
> 10.111.73.1:0/1745436331 conn(0x55aa46afd800 :6789
> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=87 cs=1 l=1).process
> aborted = 0
> 2018-07-26 11:46:23.963705 7f819e167700 5 -- 10.111.73.1:6789/0 >>
> 10.111.73.1:0/1745436331 conn(0x55aa46afd800 :6789
> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=87 cs=1 l=1). rx
> client.? seq 1 0x55aa46be4480 auth(proto 0 30 bytes epoch 0) v1
> 2018-07-26 11:46:23.963750 7f819e167700 10 -- 10.111.73.1:6789/0 >>
> 10.111.73.1:0/1745436331 conn(0x55aa46afd800 :6789 s=STATE_OPEN pgs=87
> cs=1 l=1).handle_write
> 2018-07-26 11:46:23.963755 7f81a196e700 1 -- 10.111.73.1:6789/0 <==
> client.? 10.111.73.1:0/1745436331 1 ==== auth(proto 0 30 bytes epoch 0)
> v1 ==== 60+0+0 (4135352935 0 0) 0x55aa46be4480 con 0x55aa46afd800
> 2018-07-26 11:46:23.963808 7f819e167700 10 -- 10.111.73.1:6789/0 >>
> 10.111.73.1:0/1745436331 conn(0x55aa46afd800 :6789 s=STATE_OPEN pgs=87
> cs=1 l=1)._try_send sent bytes 9 remaining bytes 0
> 2018-07-26 11:46:23.963823 7f81a196e700 10 -- 10.111.73.1:6789/0
> dispatch_throttle_release 60 to dispatch throttler 60/104857600
> 2018-07-26 11:46:24.003866 7f819d966700 10 Processor -- accept listen_fd=22
> 2018-07-26 11:46:24.003902 7f819d966700 10 Processor -- accept accepted
> incoming on sd 26
> 2018-07-26 11:46:24.003911 7f819d966700 10 -- 10.111.73.1:6789/0 >> -
> conn(0x55aa46bc1000 :-1 s=STATE_NONE pgs=0 cs=0 l=0).accept sd=26
> 2018-07-26 11:46:24.004001 7f819e167700 1 -- 10.111.73.1:6789/0 >> -
> conn(0x55aa46bc1000 :6789 s=STATE_ACCEPTING pgs=0 cs=0
> l=0)._process_connection sd=26 -
> 2018-07-26 11:46:24.004057 7f819e167700 10 -- 10.111.73.1:6789/0 >> -
> conn(0x55aa46bc1000 :6789 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._try_send
> sent bytes 281 remaining bytes 0
> 2018-07-26 11:46:24.004071 7f819e167700 10 -- 10.111.73.1:6789/0 >> -
> conn(0x55aa46bc1000 :6789 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
> l=0)._process_connection write banner and addr done: -
> 2018-07-26 11:46:24.004199 7f819e167700 10 -- 10.111.73.1:6789/0 >> -
> conn(0x55aa46bc1000 :6789 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
> l=0)._process_connection accept peer addr is 10.111.73.3:0/1033315403
> 2018-07-26 11:46:24.004286 7f819e167700 10 -- 10.111.73.1:6789/0 >>
> 10.111.73.3:0/1033315403 conn(0x55aa46bc1000 :6789
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> l=1)._process_connection accept of host_type 8, policy.lossy=1
> policy.server=1 policy.standby=0 policy.resetcheck=0
> 2018-07-26 11:46:24.004304 7f819e167700 10 -- 10.111.73.1:6789/0 >>
> 10.111.73.3:0/1033315403 conn(0x55aa46bc1000 :6789
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> l=1).handle_connect_msg accept my proto 15, their proto 15
> 2018-07-26 11:46:24.004319 7f819e167700 10 -- 10.111.73.1:6789/0 >>
> 10.111.73.3:0/1033315403 conn(0x55aa46bc1000 :6789
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> l=1).handle_connect_msg accept setting up session_security.
> 2018-07-26 11:46:24.004338 7f819e167700 10 -- 10.111.73.1:6789/0 >>
> 10.111.73.3:0/1033315403 conn(0x55aa46bc1000 :6789
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> l=1).handle_connect_msg accept new session
> 2018-07-26 11:46:24.004351 7f819e167700 10 -- 10.111.73.1:6789/0 >>
> 10.111.73.3:0/1033315403 conn(0x55aa46bc1000 :6789
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=74 cs=1
> l=1).handle_connect_msg accept success, connect_seq = 1 in_seq=0,
> sending READY
> 2018-07-26 11:46:24.004365 7f819e167700 10 -- 10.111.73.1:6789/0 >>
> 10.111.73.3:0/1033315403 conn(0x55aa46bc1000 :6789
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=74 cs=1
> l=1).handle_connect_msg accept features 4611087853745930235
> 2018-07-26 11:46:24.004463 7f819e167700 10 -- 10.111.73.1:6789/0 >>
> 10.111.73.3:0/1033315403 conn(0x55aa46bc1000 :6789
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=74 cs=1 l=1)._try_send sent
> bytes 34 remaining bytes 0
> 2018-07-26 11:46:24.004489 7f819e167700 2 -- 10.111.73.1:6789/0 >>
> 10.111.73.3:0/1033315403 conn(0x55aa46bc1000 :6789
> s=STATE_ACCEPTING_WAIT_SEQ pgs=74 cs=1 l=1).handle_connect_msg accept
> write reply msg done
> 2018-07-26 11:46:24.004634 7f819e167700 2 -- 10.111.73.1:6789/0 >>
> 10.111.73.3:0/1033315403 conn(0x55aa46bc1000 :6789
> s=STATE_ACCEPTING_WAIT_SEQ pgs=74 cs=1 l=1)._process_connection accept
> get newly_acked_seq 0
> 2018-07-26 11:46:24.004650 7f819e167700 10 -- 10.111.73.1:6789/0 >>
> 10.111.73.3:0/1033315403 conn(0x55aa46bc1000 :6789
> s=STATE_ACCEPTING_WAIT_SEQ pgs=74 cs=1 l=1).discard_requeued_up_to 0
> 2018-07-26 11:46:24.004807 7f819e167700 10 -- 10.111.73.1:6789/0 >>
> 10.111.73.3:0/1033315403 conn(0x55aa46bc1000 :6789
> s=STATE_OPEN_KEEPALIVE2 pgs=74 cs=1 l=1)._append_keepalive_or_ack
> 2018-07-26 11:46:24.004828 7f819e167700 10 -- 10.111.73.1:6789/0 >>
> 10.111.73.3:0/1033315403 conn(0x55aa46bc1000 :6789
> s=STATE_OPEN_MESSAGE_THROTTLE_BYTES pgs=74 cs=1 l=1).process wants 60
> bytes from policy throttler 180/104857600
> 2018-07-26 11:46:24.004847 7f819e167700 10 -- 10.111.73.1:6789/0 >>
> 10.111.73.3:0/1033315403 conn(0x55aa46bc1000 :6789
> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=74 cs=1 l=1).process
> aborted = 0
> 2018-07-26 11:46:24.004873 7f819e167700 5 -- 10.111.73.1:6789/0 >>
> 10.111.73.3:0/1033315403 conn(0x55aa46bc1000 :6789
> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=74 cs=1 l=1). rx
> client.? seq 1 0x55aa46be4fc0 auth(proto 0 30 bytes epoch 0) v1
> 2018-07-26 11:46:24.004914 7f819e167700 10 -- 10.111.73.1:6789/0 >>
> 10.111.73.3:0/1033315403 conn(0x55aa46bc1000 :6789 s=STATE_OPEN pgs=74
> cs=1 l=1).handle_write
> 2018-07-26 11:46:24.004921 7f81a196e700 1 -- 10.111.73.1:6789/0 <==
> client.? 10.111.73.3:0/1033315403 1 ==== auth(proto 0 30 bytes epoch 0)
> v1 ==== 60+0+0 (2547518125 0 0) 0x55aa46be4fc0 con 0x55aa46bc1000
> 2018-07-26 11:46:24.004954 7f819e167700 10 -- 10.111.73.1:6789/0 >>
> 10.111.73.3:0/1033315403 conn(0x55aa46bc1000 :6789 s=STATE_OPEN pgs=74
> cs=1 l=1)._try_send sent bytes 9 remaining bytes 0
> 2018-07-26 11:46:24.004965 7f81a196e700 10 -- 10.111.73.1:6789/0
> dispatch_throttle_release 60 to dispatch throttler 60/104857600
> 
> kind regards
> 
> ben
> 
>> Wido den Hollander <wido@xxxxxxxx> hat am 26. Juli 2018 um 11:07
> geschrieben:
>>
>>
>>
>>
>> On 07/26/2018 10:33 AM, Benjamin Naber wrote:
>> > hi Wido,
>> >
>> > thx for your reply.
>> > time is also in sync. i forced time sync again to be sure.
>> >
>>
>> Try setting debug_mon to 10 or even 20 and check the logs about what the
>> MONs are saying.
>>
>> debug_ms = 10 might also help to get some more information about the
>> Messenger Traffic.
>>
>> Wido
>>
>> > kind regards
>> >
>> > Ben
>> >
>> >> Wido den Hollander <wido@xxxxxxxx> hat am 26. Juli 2018 um 10:18
>> > geschrieben:
>> >>
>> >>
>> >>
>> >>
>> >> On 07/26/2018 10:12 AM, Benjamin Naber wrote:
>> >> > Hi together,
>> >> >
>> >> > we currently have some problems with monitor quorum after shutting
>> > down all cluster nodes for migration to another location.
>> >> >
>> >> > mon_status gives uns the following outputt:
>> >> >
>> >> > {
>> >> > "name": "mon01",
>> >> > "rank": 0,
>> >> > "state": "electing",
>> >> > "election_epoch": 20345,
>> >> > "quorum": [],
>> >> > "features": {
>> >> > "required_con": "153140804152475648",
>> >> > "required_mon": [
>> >> > "kraken",
>> >> > "luminous"
>> >> > ],
>> >> > "quorum_con": "0",
>> >> > "quorum_mon": []
>> >> > },
>> >> > "outside_quorum": [],
>> >> > "extra_probe_peers": [],
>> >> > "sync_provider": [],
>> >> > "monmap": {
>> >> > "epoch": 1,
>> >> > "fsid": "c1e3c489-67a4-47a2-a3ca-98816d1c9d44",
>> >> > "modified": "2018-06-21 13:48:58.796939",
>> >> > "created": "2018-06-21 13:48:58.796939",
>> >> > "features": {
>> >> > "persistent": [
>> >> > "kraken",
>> >> > "luminous"
>> >> > ],
>> >> > "optional": []
>> >> > },
>> >> > "mons": [
>> >> > {
>> >> > "rank": 0,
>> >> > "name": "mon01",
>> >> > "addr": "10.111.73.1:6789/0",
>> >> > "public_addr": "10.111.73.1:6789/0"
>> >> > },
>> >> > {
>> >> > "rank": 1,
>> >> > "name": "mon02",
>> >> > "addr": "10.111.73.2:6789/0",
>> >> > "public_addr": "10.111.73.2:6789/0"
>> >> > },
>> >> > {
>> >> > "rank": 2,
>> >> > "name": "mon03",
>> >> > "addr": "10.111.73.3:6789/0",
>> >> > "public_addr": "10.111.73.3:6789/0"
>> >> > }
>> >> > ]
>> >> > },
>> >> > "feature_map": {
>> >> > "mon": {
>> >> > "group": {
>> >> > "features": "0x3ffddff8eea4fffb",
>> >> > "release": "luminous",
>> >> > "num": 1
>> >> > }
>> >> > }
>> >> > }
>> >> > }
>> >> >
>> >> > ceph ping mon.id gives us also just dosent work. monitoring nodes
>> > have full network connectivity. firewall rules are also ok.
>> >> >
>> >> > what cloud be the reson for stucking quorum election ?
>> >> >
>> >>
>> >> Is the time in sync between the nodes?
>> >>
>> >> Wido
>> >>
>> >> > kind regards
>> >> >
>> >> > Ben
>> >> > _______________________________________________
>> >> > ceph-users mailing list
>> >> > ceph-users@xxxxxxxxxxxxxx
>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux