Hi, this is the debug log, 2024-03-13T11:14:28.087+0800 7f6984a95640 4 mon.memb4@3(probing) e6 probe_timeout 0x5650c2b0c3a0 2024-03-13T11:14:28.087+0800 7f6984a95640 10 mon.memb4@3(probing) e6 bootstrap 2024-03-13T11:14:28.087+0800 7f6984a95640 10 mon.memb4@3(probing) e6 sync_reset_requester 2024-03-13T11:14:28.087+0800 7f6984a95640 10 mon.memb4@3(probing) e6 unregister_cluster_logger - not registered 2024-03-13T11:14:28.087+0800 7f6984a95640 10 mon.memb4@3(probing) e6 cancel_probe_timeout (none scheduled) 2024-03-13T11:14:28.087+0800 7f6984a95640 10 mon.memb4@3(probing) e6 monmap e6: 5 mons at {memb1=[v2:10.0.4.111:3300/0,v1:10.0.4.111:6789/0],memb2=[v2: 10.0.4.112:3300/0,v1:10.0.4.112:6789/0],memb3=[v2: 10.0.4.113:3300/0,v1:10.0.4.113:6789/0],memb4=[v2: 10.0.4.114:3300/0,v1:10.0.4.114:6789/0],memb5=[v2: 10.0.4.115:3300/0,v1:10.0.4.115:6789/0]} removed_ranks: {} disallowed_leaders: {} 2024-03-13T11:14:28.087+0800 7f6984a95640 10 mon.memb4@3(probing) e6 _reset 2024-03-13T11:14:28.087+0800 7f6984a95640 10 mon.memb4@3(probing).auth v2121 _set_mon_num_rank num 0 rank 0 2024-03-13T11:14:28.087+0800 7f6984a95640 10 mon.memb4@3(probing) e6 cancel_probe_timeout (none scheduled) 2024-03-13T11:14:28.087+0800 7f6984a95640 10 mon.memb4@3(probing) e6 timecheck_finish 2024-03-13T11:14:28.087+0800 7f6984a95640 10 mon.memb4@3(probing) e6 scrub_event_cancel 2024-03-13T11:14:28.087+0800 7f6984a95640 10 mon.memb4@3(probing) e6 scrub_reset 2024-03-13T11:14:28.087+0800 7f6984a95640 10 mon.memb4@3(probing) e6 cancel_probe_timeout (none scheduled) 2024-03-13T11:14:28.087+0800 7f6984a95640 10 mon.memb4@3(probing) e6 reset_probe_timeout 0x5650bb5c8380 after 2 seconds 2024-03-13T11:14:28.087+0800 7f6984a95640 10 mon.memb4@3(probing) e6 probing other monitors 2024-03-13T11:14:28.087+0800 7f6984a95640 1 -- [v2: 10.0.4.114:3300/0,v1:10.0.4.114:6789/0] send_to--> mon [v2: 10.0.4.111:3300/0,v1:10.0.4.111:6789/0] -- mon_probe(probe c6ee9a01-944f-4745-be86-86e4a2a30e0d name memb4 leader -1 mon_release reef) v8 -- ?+0 0x5650d8765a00 2024-03-13T11:14:28.087+0800 7f6984a95640 10 -- [v2: 10.0.4.114:3300/0,v1:10.0.4.114:6789/0] connect_to [v2: 10.0.4.111:3300/0,v1:10.0.4.111:6789/0] existing 0x565071e1dc00 2024-03-13T11:14:28.087+0800 7f6984a95640 1 -- [v2: 10.0.4.114:3300/0,v1:10.0.4.114:6789/0] --> [v2: 10.0.4.111:3300/0,v1:10.0.4.111:6789/0] -- mon_probe(probe c6ee9a01-944f-4745-be86-86e4a2a30e0d name memb4 leader -1 mon_release reef) v8 -- 0x5650d8765a00 con 0x565071e1dc00 2024-03-13T11:14:28.087+0800 7f6984a95640 5 --2- [v2: 10.0.4.114:3300/0,v1:10.0.4.114:6789/0] >> [v2: 10.0.4.111:3300/0,v1:10.0.4.111:6789/0] conn(0x565071e1dc00 0x565070fbac00 unknown :-1 s=BANNER_CONNECTING pgs=20 cs=955 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).send_message enqueueing message m=0x5650d8765a00 type=67 mon_probe(probe c6ee9a01-944f-4745-be86-86e4a2a30e0d name memb4 leader -1 mon_release reef) v8 2024-03-13T11:14:28.087+0800 7f6984a95640 1 -- [v2: 10.0.4.114:3300/0,v1:10.0.4.114:6789/0] send_to--> mon [v2: 10.0.4.112:3300/0,v1:10.0.4.112:6789/0] -- mon_probe(probe c6ee9a01-944f-4745-be86-86e4a2a30e0d name memb4 leader -1 mon_release reef) v8 -- ?+0 0x5650d8765c00 2024-03-13T11:14:28.087+0800 7f6984a95640 10 -- [v2: 10.0.4.114:3300/0,v1:10.0.4.114:6789/0] connect_to [v2: 10.0.4.112:3300/0,v1:10.0.4.112:6789/0] existing 0x5650721d6c00 2024-03-13T11:14:28.087+0800 7f6984a95640 1 -- [v2: 10.0.4.114:3300/0,v1:10.0.4.114:6789/0] --> [v2: 10.0.4.112:3300/0,v1:10.0.4.112:6789/0] -- mon_probe(probe c6ee9a01-944f-4745-be86-86e4a2a30e0d name memb4 leader -1 mon_release reef) v8 -- 0x5650d8765c00 con 0x5650721d6c00 2024-03-13T11:14:28.087+0800 7f6984a95640 5 --2- [v2: 10.0.4.114:3300/0,v1:10.0.4.114:6789/0] >> [v2: 10.0.4.112:3300/0,v1:10.0.4.112:6789/0] conn(0x5650721d6c00 0x565070fba680 unknown :-1 s=BANNER_CONNECTING pgs=92 cs=960 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).send_message enqueueing message m=0x5650d8765c00 type=67 mon_probe(probe c6ee9a01-944f-4745-be86-86e4a2a30e0d name memb4 leader -1 mon_release reef) v8 2024-03-13T11:14:28.087+0800 7f6984a95640 1 -- [v2: 10.0.4.114:3300/0,v1:10.0.4.114:6789/0] send_to--> mon [v2: 10.0.4.113:3300/0,v1:10.0.4.113:6789/0] -- mon_probe(probe c6ee9a01-944f-4745-be86-86e4a2a30e0d name memb4 leader -1 mon_release reef) v8 -- ?+0 0x5650d8765e00 2024-03-13T11:14:28.087+0800 7f6984a95640 10 -- [v2: 10.0.4.114:3300/0,v1:10.0.4.114:6789/0] connect_to [v2: 10.0.4.113:3300/0,v1:10.0.4.113:6789/0] existing 0x5650721d7000 2024-03-13T11:14:28.087+0800 7f6984a95640 1 -- [v2: 10.0.4.114:3300/0,v1:10.0.4.114:6789/0] --> [v2: 10.0.4.113:3300/0,v1:10.0.4.113:6789/0] -- mon_probe(probe c6ee9a01-944f-4745-be86-86e4a2a30e0d name memb4 leader -1 mon_release reef) v8 -- 0x5650d8765e00 con 0x5650721d7000 2024-03-13T11:14:28.087+0800 7f6984a95640 5 --2- [v2: 10.0.4.114:3300/0,v1:10.0.4.114:6789/0] >> [v2: 10.0.4.113:3300/0,v1:10.0.4.113:6789/0] conn(0x5650721d7000 0x565070fba100 unknown :-1 s=BANNER_CONNECTING pgs=20 cs=962 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).send_message enqueueing message m=0x5650d8765e00 type=67 mon_probe(probe c6ee9a01-944f-4745-be86-86e4a2a30e0d name memb4 leader -1 mon_release reef) v8 2024-03-13T11:14:28.087+0800 7f6984a95640 1 -- [v2: 10.0.4.114:3300/0,v1:10.0.4.114:6789/0] send_to--> mon [v2: 10.0.4.115:3300/0,v1:10.0.4.115:6789/0] -- mon_probe(probe c6ee9a01-944f-4745-be86-86e4a2a30e0d name memb4 leader -1 mon_release reef) v8 -- ?+0 0x5650d8768000 2024-03-13T11:14:28.087+0800 7f6984a95640 10 -- [v2: 10.0.4.114:3300/0,v1:10.0.4.114:6789/0] connect_to [v2: 10.0.4.115:3300/0,v1:10.0.4.115:6789/0] existing 0x5650721d7400 2024-03-13T11:14:28.087+0800 7f6984a95640 1 -- [v2: 10.0.4.114:3300/0,v1:10.0.4.114:6789/0] --> [v2: 10.0.4.115:3300/0,v1:10.0.4.115:6789/0] -- mon_probe(probe c6ee9a01-944f-4745-be86-86e4a2a30e0d name memb4 leader -1 mon_release reef) v8 -- 0x5650d8768000 con 0x5650721d7400 2024-03-13T11:14:28.087+0800 7f6984a95640 5 --2- [v2: 10.0.4.114:3300/0,v1:10.0.4.114:6789/0] >> [v2: 10.0.4.115:3300/0,v1:10.0.4.115:6789/0] conn(0x5650721d7400 0x565070fbb180 unknown :-1 s=START_CONNECT pgs=20 cs=948 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).send_message enqueueing message m=0x5650d8768000 type=67 mon_probe(probe c6ee9a01-944f-4745-be86-86e4a2a30e0d name memb4 leader -1 mon_release reef) v8 2024-03-13T11:14:29.795+0800 7f6988a9d640 1 -- [v2: 10.0.4.114:3300/0,v1:10.0.4.114:6789/0] >> [v2: 10.0.4.113:3300/0,v1:10.0.4.113:6789/0] conn(0x5650721d7000 msgr2=0x565070fba100 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).tick see no progress in more than 10000000 us during connecting to v2: 10.0.4.113:3300/0, fault. 2024-03-13T11:14:29.795+0800 7f6988a9d640 10 --2- [v2: 10.0.4.114:3300/0,v1:10.0.4.114:6789/0] >> [v2: 10.0.4.113:3300/0,v1:10.0.4.113:6789/0] conn(0x5650721d7000 0x565070fba100 unknown :-1 s=BANNER_CONNECTING pgs=20 cs=962 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0)._fault 2024-03-13T11:14:29.795+0800 7f6980206640 1 RDMAStack handle_tx_event sending of the disconnect msg completed 2024-03-13T11:14:29.795+0800 7f6988a9d640 5 --2- [v2: 10.0.4.114:3300/0,v1:10.0.4.114:6789/0] >> [v2: 10.0.4.113:3300/0,v1:10.0.4.113:6789/0] conn(0x5650721d7000 0x565070fba100 unknown :-1 s=BANNER_CONNECTING pgs=20 cs=962 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).reset_recv_state 2024-03-13T11:14:29.795+0800 7f6988a9d640 5 --2- [v2: 10.0.4.114:3300/0,v1:10.0.4.114:6789/0] >> [v2: 10.0.4.113:3300/0,v1:10.0.4.113:6789/0] conn(0x5650721d7000 0x565070fba100 unknown :-1 s=BANNER_CONNECTING pgs=20 cs=962 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).reset_security 2024-03-13T11:14:29.795+0800 7f6988a9d640 5 --2- [v2: 10.0.4.114:3300/0,v1:10.0.4.114:6789/0] >> [v2: 10.0.4.113:3300/0,v1:10.0.4.113:6789/0] conn(0x5650721d7000 0x565070fba100 unknown :-1 s=BANNER_CONNECTING pgs=20 cs=962 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).reset_compression 2024-03-13T11:14:29.795+0800 7f6988a9d640 1 --2- [v2: 10.0.4.114:3300/0,v1:10.0.4.114:6789/0] >> [v2: 10.0.4.113:3300/0,v1:10.0.4.113:6789/0] conn(0x5650721d7000 0x565070fba100 unknown :-1 s=START_CONNECT pgs=20 cs=963 l=0 rev1=1 crypto rx=0 tx=0 c omp rx=0 tx=0)._fault waiting 15.000000 2024-03-13T11:14:29.795+0800 7f6980206640 10 RDMAStack polling finally delete qp = 0x5650c54164b0 Eugen Block <eblock@xxxxxx> 于2024年3月19日周二 14:50写道: > Hi, > > there are several existing threads on this list, have you tried to > apply those suggestions? A couple of them were: > > - ceph mgr fail > - check time sync (NTP, chrony) > - different weights for MONs > - Check debug logs > > Regards, > Eugen > > Zitat von faicker mo <faicker.mo@xxxxxxxxx>: > > > some logs here, > > 2024-03-13T11:13:34.083+0800 7f6984a95640 4 mon.memb4@3(probing) e6 > > probe_timeout 0x5650c19d6100 > > 2024-03-13T11:13:34.083+0800 7f6984a95640 10 mon.memb4@3(probing) e6 > > bootstrap > > 2024-03-13T11:13:34.083+0800 7f6984a95640 10 mon.memb4@3(probing) e6 > > sync_reset_requester > > 2024-03-13T11:13:34.083+0800 7f6984a95640 10 mon.memb4@3(probing) e6 > > unregister_cluster_logger - not registered > > 2024-03-13T11:13:34.083+0800 7f6984a95640 10 mon.memb4@3(probing) e6 > > cancel_probe_timeout (none scheduled) > > 2024-03-13T11:13:34.083+0800 7f6984a95640 10 mon.memb4@3(probing) e6 > monmap > > e6: 5 mons at {memb1=[v2:10.0.4.111:3300/0,v1:10.0.4.111:6789/0 > ],memb2=[v2: > > 10.0.4.112:3300/0,v1:10.0.4.112:6789/0],memb3=[v2: > > 10.0.4.113:3300/0,v1:10.0.4.113:6789/0],memb4=[v2: > > 10.0.4.114:3300/0,v1:10.0.4.114:6789/0],memb5=[v2: > > 10.0.4.115:3300/0,v1:10.0.4.115:6789/0]} removed_ranks: {} > > disallowed_leaders: {} > > 2024-03-13T11:13:34.083+0800 7f6984a95640 10 mon.memb4@3(probing) e6 > _reset > > 2024-03-13T11:13:34.083+0800 7f6984a95640 10 mon.memb4@3(probing).auth > > v2121 _set_mon_num_rank num 0 rank 0 > > 2024-03-13T11:13:34.083+0800 7f6984a95640 10 mon.memb4@3(probing) e6 > > cancel_probe_timeout (none scheduled) > > 2024-03-13T11:13:34.083+0800 7f6984a95640 10 mon.memb4@3(probing) e6 > > timecheck_finish > > 2024-03-13T11:13:34.083+0800 7f6984a95640 10 mon.memb4@3(probing) e6 > > scrub_event_cancel > > 2024-03-13T11:13:34.083+0800 7f6984a95640 10 mon.memb4@3(probing) e6 > > scrub_reset > > 2024-03-13T11:13:34.083+0800 7f6984a95640 10 mon.memb4@3(probing) e6 > > cancel_probe_timeout (none scheduled) > > 2024-03-13T11:13:34.083+0800 7f6984a95640 10 mon.memb4@3(probing) e6 > > reset_probe_timeout 0x5650bb5c9780 after 2 seconds > > 2024-03-13T11:13:34.083+0800 7f6984a95640 10 mon.memb4@3(probing) e6 > > probing other monitors > > 2024-03-13T11:13:34.399+0800 7f697fa05640 10 mon.memb4@3(probing) e6 > > ms_handle_reset 0x5650bd339800 - > > 2024-03-13T11:13:34.403+0800 7f697fa05640 10 mon.memb4@3(probing) e6 > > ms_handle_reset 0x5650c45e2800 - > > > > faicker mo <faicker.mo@xxxxxxxxx> 于2024年3月13日周三 16:02写道: > > > >> Hello, > >> The problem is a mon stucked in probing state. > >> The env is ceph 18.2.1 on ubuntu22.04 with rdma, 5 mons. One mon memb4 > >> is out of quorum. > >> The debug log is attached. > >> Thanks. > >> > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx