Re: mon stuck in probing

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



This is not much to work on, to be honest. Have you tried any of the suggested debugging steps and checked existing threads?

Zitat von faicker mo <faicker.mo@xxxxxxxxx>:

Hi, this is the debug log,

2024-03-13T11:14:28.087+0800 7f6984a95640  4 mon.memb4@3(probing) e6
probe_timeout 0x5650c2b0c3a0
2024-03-13T11:14:28.087+0800 7f6984a95640 10 mon.memb4@3(probing) e6
bootstrap
2024-03-13T11:14:28.087+0800 7f6984a95640 10 mon.memb4@3(probing) e6
sync_reset_requester
2024-03-13T11:14:28.087+0800 7f6984a95640 10 mon.memb4@3(probing) e6
unregister_cluster_logger - not registered
2024-03-13T11:14:28.087+0800 7f6984a95640 10 mon.memb4@3(probing) e6
cancel_probe_timeout (none scheduled)
2024-03-13T11:14:28.087+0800 7f6984a95640 10 mon.memb4@3(probing) e6 monmap
e6: 5 mons at {memb1=[v2:10.0.4.111:3300/0,v1:10.0.4.111:6789/0],memb2=[v2:
10.0.4.112:3300/0,v1:10.0.4.112:6789/0],memb3=[v2:
10.0.4.113:3300/0,v1:10.0.4.113:6789/0],memb4=[v2:
10.0.4.114:3300/0,v1:10.0.4.114:6789/0],memb5=[v2:
10.0.4.115:3300/0,v1:10.0.4.115:6789/0]} removed_ranks: {}
disallowed_leaders: {}
2024-03-13T11:14:28.087+0800 7f6984a95640 10 mon.memb4@3(probing) e6 _reset
2024-03-13T11:14:28.087+0800 7f6984a95640 10 mon.memb4@3(probing).auth
v2121 _set_mon_num_rank num 0 rank 0
2024-03-13T11:14:28.087+0800 7f6984a95640 10 mon.memb4@3(probing) e6
cancel_probe_timeout (none scheduled)
2024-03-13T11:14:28.087+0800 7f6984a95640 10 mon.memb4@3(probing) e6
timecheck_finish
2024-03-13T11:14:28.087+0800 7f6984a95640 10 mon.memb4@3(probing) e6
scrub_event_cancel
2024-03-13T11:14:28.087+0800 7f6984a95640 10 mon.memb4@3(probing) e6
scrub_reset
2024-03-13T11:14:28.087+0800 7f6984a95640 10 mon.memb4@3(probing) e6
cancel_probe_timeout (none scheduled)
2024-03-13T11:14:28.087+0800 7f6984a95640 10 mon.memb4@3(probing) e6
reset_probe_timeout 0x5650bb5c8380 after 2 seconds
2024-03-13T11:14:28.087+0800 7f6984a95640 10 mon.memb4@3(probing) e6
probing other monitors
2024-03-13T11:14:28.087+0800 7f6984a95640  1 -- [v2:
10.0.4.114:3300/0,v1:10.0.4.114:6789/0] send_to--> mon [v2:
10.0.4.111:3300/0,v1:10.0.4.111:6789/0] -- mon_probe(probe
c6ee9a01-944f-4745-be86-86e4a2a30e0d name memb4 leader -1 mon_release reef)
v8 -- ?+0 0x5650d8765a00
2024-03-13T11:14:28.087+0800 7f6984a95640 10 -- [v2:
10.0.4.114:3300/0,v1:10.0.4.114:6789/0] connect_to [v2:
10.0.4.111:3300/0,v1:10.0.4.111:6789/0] existing 0x565071e1dc00
2024-03-13T11:14:28.087+0800 7f6984a95640  1 -- [v2:
10.0.4.114:3300/0,v1:10.0.4.114:6789/0] --> [v2:
10.0.4.111:3300/0,v1:10.0.4.111:6789/0] -- mon_probe(probe
c6ee9a01-944f-4745-be86-86e4a2a30e0d name memb4 leader -1 mon_release reef)
v8 -- 0x5650d8765a00 con 0x565071e1dc00
2024-03-13T11:14:28.087+0800 7f6984a95640  5 --2- [v2:
10.0.4.114:3300/0,v1:10.0.4.114:6789/0] >> [v2:
10.0.4.111:3300/0,v1:10.0.4.111:6789/0] conn(0x565071e1dc00 0x565070fbac00
unknown :-1 s=BANNER_CONNECTING pgs=20 cs=955 l=0 rev1=1 crypto rx=0 tx=0
comp rx=0 tx=0).send_message enqueueing message m=0x5650d8765a00 type=67
mon_probe(probe c6ee9a01-944f-4745-be86-86e4a2a30e0d name memb4 leader -1
mon_release reef) v8
2024-03-13T11:14:28.087+0800 7f6984a95640  1 -- [v2:
10.0.4.114:3300/0,v1:10.0.4.114:6789/0] send_to--> mon [v2:
10.0.4.112:3300/0,v1:10.0.4.112:6789/0] -- mon_probe(probe
c6ee9a01-944f-4745-be86-86e4a2a30e0d name memb4 leader -1 mon_release reef)
v8 -- ?+0 0x5650d8765c00
2024-03-13T11:14:28.087+0800 7f6984a95640 10 -- [v2:
10.0.4.114:3300/0,v1:10.0.4.114:6789/0] connect_to [v2:
10.0.4.112:3300/0,v1:10.0.4.112:6789/0] existing 0x5650721d6c00
2024-03-13T11:14:28.087+0800 7f6984a95640  1 -- [v2:
10.0.4.114:3300/0,v1:10.0.4.114:6789/0] --> [v2:
10.0.4.112:3300/0,v1:10.0.4.112:6789/0] -- mon_probe(probe
c6ee9a01-944f-4745-be86-86e4a2a30e0d name memb4 leader -1 mon_release reef)
v8 -- 0x5650d8765c00 con 0x5650721d6c00
2024-03-13T11:14:28.087+0800 7f6984a95640  5 --2- [v2:
10.0.4.114:3300/0,v1:10.0.4.114:6789/0] >> [v2:
10.0.4.112:3300/0,v1:10.0.4.112:6789/0] conn(0x5650721d6c00 0x565070fba680
unknown :-1 s=BANNER_CONNECTING pgs=92 cs=960 l=0 rev1=1 crypto rx=0 tx=0
comp rx=0 tx=0).send_message enqueueing message m=0x5650d8765c00 type=67
mon_probe(probe c6ee9a01-944f-4745-be86-86e4a2a30e0d name memb4 leader -1
mon_release reef) v8
2024-03-13T11:14:28.087+0800 7f6984a95640  1 -- [v2:
10.0.4.114:3300/0,v1:10.0.4.114:6789/0] send_to--> mon [v2:
10.0.4.113:3300/0,v1:10.0.4.113:6789/0] -- mon_probe(probe
c6ee9a01-944f-4745-be86-86e4a2a30e0d name memb4 leader -1 mon_release reef)
v8 -- ?+0 0x5650d8765e00
2024-03-13T11:14:28.087+0800 7f6984a95640 10 -- [v2:
10.0.4.114:3300/0,v1:10.0.4.114:6789/0] connect_to [v2:
10.0.4.113:3300/0,v1:10.0.4.113:6789/0] existing 0x5650721d7000
2024-03-13T11:14:28.087+0800 7f6984a95640  1 -- [v2:
10.0.4.114:3300/0,v1:10.0.4.114:6789/0] --> [v2:
10.0.4.113:3300/0,v1:10.0.4.113:6789/0] -- mon_probe(probe
c6ee9a01-944f-4745-be86-86e4a2a30e0d name memb4 leader -1 mon_release reef)
v8 -- 0x5650d8765e00 con 0x5650721d7000
2024-03-13T11:14:28.087+0800 7f6984a95640  5 --2- [v2:
10.0.4.114:3300/0,v1:10.0.4.114:6789/0] >> [v2:
10.0.4.113:3300/0,v1:10.0.4.113:6789/0] conn(0x5650721d7000 0x565070fba100
unknown :-1 s=BANNER_CONNECTING pgs=20 cs=962 l=0 rev1=1 crypto rx=0 tx=0
comp rx=0 tx=0).send_message enqueueing message m=0x5650d8765e00 type=67
mon_probe(probe c6ee9a01-944f-4745-be86-86e4a2a30e0d name memb4 leader -1
mon_release reef) v8
2024-03-13T11:14:28.087+0800 7f6984a95640  1 -- [v2:
10.0.4.114:3300/0,v1:10.0.4.114:6789/0] send_to--> mon [v2:
10.0.4.115:3300/0,v1:10.0.4.115:6789/0] -- mon_probe(probe
c6ee9a01-944f-4745-be86-86e4a2a30e0d name memb4 leader -1 mon_release reef)
v8 -- ?+0 0x5650d8768000
2024-03-13T11:14:28.087+0800 7f6984a95640 10 -- [v2:
10.0.4.114:3300/0,v1:10.0.4.114:6789/0] connect_to [v2:
10.0.4.115:3300/0,v1:10.0.4.115:6789/0] existing 0x5650721d7400
2024-03-13T11:14:28.087+0800 7f6984a95640  1 -- [v2:
10.0.4.114:3300/0,v1:10.0.4.114:6789/0] --> [v2:
10.0.4.115:3300/0,v1:10.0.4.115:6789/0] -- mon_probe(probe
c6ee9a01-944f-4745-be86-86e4a2a30e0d name memb4 leader -1 mon_release reef)
v8 -- 0x5650d8768000 con 0x5650721d7400
2024-03-13T11:14:28.087+0800 7f6984a95640  5 --2- [v2:
10.0.4.114:3300/0,v1:10.0.4.114:6789/0] >> [v2:
10.0.4.115:3300/0,v1:10.0.4.115:6789/0] conn(0x5650721d7400 0x565070fbb180
unknown :-1 s=START_CONNECT pgs=20 cs=948 l=0 rev1=1 crypto rx=0 tx=0 comp
rx=0 tx=0).send_message enqueueing message m=0x5650d8768000 type=67
mon_probe(probe c6ee9a01-944f-4745-be86-86e4a2a30e0d name memb4 leader -1
mon_release reef) v8
2024-03-13T11:14:29.795+0800 7f6988a9d640  1 -- [v2:
10.0.4.114:3300/0,v1:10.0.4.114:6789/0] >> [v2:
10.0.4.113:3300/0,v1:10.0.4.113:6789/0] conn(0x5650721d7000
msgr2=0x565070fba100 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).tick
see no progress in more than 10000000 us during connecting to v2:
10.0.4.113:3300/0, fault.
2024-03-13T11:14:29.795+0800 7f6988a9d640 10 --2- [v2:
10.0.4.114:3300/0,v1:10.0.4.114:6789/0] >> [v2:
10.0.4.113:3300/0,v1:10.0.4.113:6789/0] conn(0x5650721d7000 0x565070fba100
unknown :-1 s=BANNER_CONNECTING pgs=20 cs=962 l=0 rev1=1 crypto rx=0 tx=0
comp rx=0 tx=0)._fault
2024-03-13T11:14:29.795+0800 7f6980206640  1 RDMAStack handle_tx_event
sending of the disconnect msg completed
2024-03-13T11:14:29.795+0800 7f6988a9d640  5 --2- [v2:
10.0.4.114:3300/0,v1:10.0.4.114:6789/0] >> [v2:
10.0.4.113:3300/0,v1:10.0.4.113:6789/0] conn(0x5650721d7000 0x565070fba100
unknown :-1 s=BANNER_CONNECTING pgs=20 cs=962 l=0 rev1=1 crypto rx=0 tx=0
comp rx=0 tx=0).reset_recv_state
2024-03-13T11:14:29.795+0800 7f6988a9d640  5 --2- [v2:
10.0.4.114:3300/0,v1:10.0.4.114:6789/0] >> [v2:
10.0.4.113:3300/0,v1:10.0.4.113:6789/0] conn(0x5650721d7000 0x565070fba100
unknown :-1 s=BANNER_CONNECTING pgs=20 cs=962 l=0 rev1=1 crypto rx=0 tx=0
comp rx=0 tx=0).reset_security
2024-03-13T11:14:29.795+0800 7f6988a9d640  5 --2- [v2:
10.0.4.114:3300/0,v1:10.0.4.114:6789/0] >> [v2:
10.0.4.113:3300/0,v1:10.0.4.113:6789/0] conn(0x5650721d7000 0x565070fba100
unknown :-1 s=BANNER_CONNECTING pgs=20 cs=962 l=0 rev1=1 crypto rx=0 tx=0
comp rx=0 tx=0).reset_compression
2024-03-13T11:14:29.795+0800 7f6988a9d640  1 --2- [v2:
10.0.4.114:3300/0,v1:10.0.4.114:6789/0] >> [v2:
10.0.4.113:3300/0,v1:10.0.4.113:6789/0] conn(0x5650721d7000 0x565070fba100
unknown :-1 s=START_CONNECT pgs=20 cs=963 l=0 rev1=1 crypto rx=0 tx=0 c
omp rx=0 tx=0)._fault waiting 15.000000
2024-03-13T11:14:29.795+0800 7f6980206640 10 RDMAStack polling finally
delete qp = 0x5650c54164b0

Eugen Block <eblock@xxxxxx> 于2024年3月19日周二 14:50写道:

Hi,

there are several existing threads on this list, have you tried to
apply those suggestions? A couple of them were:

- ceph mgr fail
- check time sync (NTP, chrony)
- different weights for MONs
- Check debug logs

Regards,
Eugen

Zitat von faicker mo <faicker.mo@xxxxxxxxx>:

> some logs here,
> 2024-03-13T11:13:34.083+0800 7f6984a95640  4 mon.memb4@3(probing) e6
> probe_timeout 0x5650c19d6100
> 2024-03-13T11:13:34.083+0800 7f6984a95640 10 mon.memb4@3(probing) e6
> bootstrap
> 2024-03-13T11:13:34.083+0800 7f6984a95640 10 mon.memb4@3(probing) e6
> sync_reset_requester
> 2024-03-13T11:13:34.083+0800 7f6984a95640 10 mon.memb4@3(probing) e6
> unregister_cluster_logger - not registered
> 2024-03-13T11:13:34.083+0800 7f6984a95640 10 mon.memb4@3(probing) e6
> cancel_probe_timeout (none scheduled)
> 2024-03-13T11:13:34.083+0800 7f6984a95640 10 mon.memb4@3(probing) e6
monmap
> e6: 5 mons at {memb1=[v2:10.0.4.111:3300/0,v1:10.0.4.111:6789/0
],memb2=[v2:
> 10.0.4.112:3300/0,v1:10.0.4.112:6789/0],memb3=[v2:
> 10.0.4.113:3300/0,v1:10.0.4.113:6789/0],memb4=[v2:
> 10.0.4.114:3300/0,v1:10.0.4.114:6789/0],memb5=[v2:
> 10.0.4.115:3300/0,v1:10.0.4.115:6789/0]} removed_ranks: {}
> disallowed_leaders: {}
> 2024-03-13T11:13:34.083+0800 7f6984a95640 10 mon.memb4@3(probing) e6
_reset
> 2024-03-13T11:13:34.083+0800 7f6984a95640 10 mon.memb4@3(probing).auth
> v2121 _set_mon_num_rank num 0 rank 0
> 2024-03-13T11:13:34.083+0800 7f6984a95640 10 mon.memb4@3(probing) e6
> cancel_probe_timeout (none scheduled)
> 2024-03-13T11:13:34.083+0800 7f6984a95640 10 mon.memb4@3(probing) e6
> timecheck_finish
> 2024-03-13T11:13:34.083+0800 7f6984a95640 10 mon.memb4@3(probing) e6
> scrub_event_cancel
> 2024-03-13T11:13:34.083+0800 7f6984a95640 10 mon.memb4@3(probing) e6
> scrub_reset
> 2024-03-13T11:13:34.083+0800 7f6984a95640 10 mon.memb4@3(probing) e6
> cancel_probe_timeout (none scheduled)
> 2024-03-13T11:13:34.083+0800 7f6984a95640 10 mon.memb4@3(probing) e6
> reset_probe_timeout 0x5650bb5c9780 after 2 seconds
> 2024-03-13T11:13:34.083+0800 7f6984a95640 10 mon.memb4@3(probing) e6
> probing other monitors
> 2024-03-13T11:13:34.399+0800 7f697fa05640 10 mon.memb4@3(probing) e6
> ms_handle_reset 0x5650bd339800 -
> 2024-03-13T11:13:34.403+0800 7f697fa05640 10 mon.memb4@3(probing) e6
> ms_handle_reset 0x5650c45e2800 -
>
> faicker mo <faicker.mo@xxxxxxxxx> 于2024年3月13日周三 16:02写道:
>
>> Hello,
>>   The problem is a mon stucked in probing state.
>>   The env is ceph 18.2.1 on ubuntu22.04 with rdma, 5 mons. One mon memb4
>> is out of quorum.
>>   The debug log is attached.
>>   Thanks.
>>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux