Re: 1/3 mons down! mon do not rejoin

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am So., 25. Juli 2021 um 18:02 Uhr schrieb Dan van der Ster
<dan@xxxxxxxxxxxxxx>:
>
> What do you have for the new global_id settings? Maybe set it to allow insecure global_id auth and see if that allows the mon to join?

 auth_allow_insecure_global_id_reclaim is allowed as we still have
some VM's not restarted

# ceph config get mon.*
WHO MASK LEVEL    OPTION                                         VALUE RO
mon      advanced auth_allow_insecure_global_id_reclaim          true
mon      advanced mon_warn_on_insecure_global_id_reclaim         false
mon      advanced mon_warn_on_insecure_global_id_reclaim_allowed false

> > I can try to move the /var/lib/ceph/mon/ dir and recreate it!?
>
> I'm not sure it will help. Running the mon with --debug_ms=1 might give clues why it's stuck probing.

2021-07-25 16:28:41.418 7fcc613d8700 10 mon.osd01@0(probing) e1
probing other monitors
2021-07-25 16:28:41.418 7fcc613d8700  1 --
[v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] send_to--> mon
[v2:10.152.28.172:3300/0,v1:10.152.28.172:6789/0] -- mon_probe(probe
a6baa789-6be2-4ce0-ab2d-7c78b899d4bd name osd01 mon_release 14) v7 --
?+0 0x55c6b35ae780
2021-07-25 16:28:41.418 7fcc613d8700  1 --
[v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] -->
[v2:10.152.28.172:3300/0,v1:10.152.28.172:6789/0] -- mon_probe(probe
a6baa789-6be2-4ce0-ab2d-7c78b899d4bd name osd01 mon_release 14) v7 --
0x55c6b35ae780 con 0x55c6b2611180
2021-07-25 16:28:41.418 7fcc613d8700  1 --
[v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] send_to--> mon
[v2:10.152.28.173:3300/0,v1:10.152.28.173:6789/0] -- mon_probe(probe
a6baa789-6be2-4ce0-ab2d-7c78b899d4bd name osd01 mon_release 14) v7 --
?+0 0x55c6b35aea00
2021-07-25 16:28:41.418 7fcc613d8700  1 --
[v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] -->
[v2:10.152.28.173:3300/0,v1:10.152.28.173:6789/0] -- mon_probe(probe
a6baa789-6be2-4ce0-ab2d-7c78b899d4bd name osd01 mon_release 14) v7 --
0x55c6b35aea00 con 0x55c6b2611600
2021-07-25 16:28:41.814 7fcc5dbd1700  1 --2-
[v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
[v2:10.152.28.173:3300/0,v1:10.152.28.173:6789/0] conn(0x55c6b2611600
0x55c6b3323c00 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=0 rev1=1
rx=0 tx=0)._handle_peer_banner_payload supported=1 required=0
2021-07-25 16:28:41.814 7fcc62bdb700  1 --2-
[v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
[v2:10.152.28.172:3300/0,v1:10.152.28.172:6789/0] conn(0x55c6b2611180
0x55c6b3323500 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=0 rev1=1
rx=0 tx=0)._handle_peer_banner_payload supported=1 required=0
2021-07-25 16:28:41.814 7fcc62bdb700 10 mon.osd01@0(probing) e1
ms_get_authorizer for mon
2021-07-25 16:28:41.814 7fcc5dbd1700 10 mon.osd01@0(probing) e1
ms_get_authorizer for mon
2021-07-25 16:28:41.814 7fcc62bdb700  1 --
[v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
[v2:10.152.28.172:3300/0,v1:10.152.28.172:6789/0] conn(0x55c6b2611180
msgr2=0x55c6b3323500 secure :-1 s=STATE_CONNECTION_ESTABLISHED
l=0).read_bulk peer close file descriptor 27
2021-07-25 16:28:41.814 7fcc62bdb700  1 --
[v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
[v2:10.152.28.172:3300/0,v1:10.152.28.172:6789/0] conn(0x55c6b2611180
msgr2=0x55c6b3323500 secure :-1 s=STATE_CONNECTION_ESTABLISHED
l=0).read_until read failed
2021-07-25 16:28:41.814 7fcc62bdb700  1 --2-
[v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
[v2:10.152.28.172:3300/0,v1:10.152.28.172:6789/0] conn(0x55c6b2611180
0x55c6b3323500 secure :-1 s=SESSION_CONNECTING pgs=0 cs=0 l=0 rev1=1
rx=0x55c6b34bbad0 tx=0x55c6b3528130).handle_read_frame_preamble_main
read frame preamble failed r=-1 ((1) Operation not permitted)
2021-07-25 16:28:41.814 7fcc5dbd1700  1 --
[v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
[v2:10.152.28.173:3300/0,v1:10.152.28.173:6789/0] conn(0x55c6b2611600
msgr2=0x55c6b3323c00 secure :-1 s=STATE_CONNECTION_ESTABLISHED
l=0).read_bulk peer close file descriptor 28
2021-07-25 16:28:41.814 7fcc5dbd1700  1 --
[v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
[v2:10.152.28.173:3300/0,v1:10.152.28.173:6789/0] conn(0x55c6b2611600
msgr2=0x55c6b3323c00 secure :-1 s=STATE_CONNECTION_ESTABLISHED
l=0).read_until read failed
2021-07-25 16:28:41.814 7fcc5dbd1700  1 --2-
[v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
[v2:10.152.28.173:3300/0,v1:10.152.28.173:6789/0] conn(0x55c6b2611600
0x55c6b3323c00 secure :-1 s=SESSION_CONNECTING pgs=0 cs=0 l=0 rev1=1
rx=0x55c6b3553830 tx=0x55c6b34809a0).handle_read_frame_preamble_main
read frame preamble failed r=-1 ((1) Operation not permitted)
2021-07-25 16:28:41.814 7fcc62bdb700  1 --2-
[v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
[v2:10.152.28.172:3300/0,v1:10.152.28.172:6789/0] conn(0x55c6b2611180
0x55c6b3323500 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=0 rev1=1 rx=0
tx=0)._fault waiting 15.000000
2021-07-25 16:28:41.814 7fcc5dbd1700  1 --2-
[v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
[v2:10.152.28.173:3300/0,v1:10.152.28.173:6789/0] conn(0x55c6b2611600
0x55c6b3323c00 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=0 rev1=1 rx=0
tx=0)._fault waiting 15.000000
2021-07-25 16:28:42.934 7fcc5dbd1700  1 --2-
[v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
conn(0x55c6b35a0d00 0x55c6b3325100 unknown :-1 s=NONE pgs=0 cs=0 l=0
rev1=0 rx=0 tx=0).accept
2021-07-25 16:28:42.934 7fcc633dc700  1 --2-
[v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
conn(0x55c6b35a0d00 0x55c6b3325100 unknown :-1 s=BANNER_ACCEPTING
pgs=0 cs=0 l=0 rev1=0 rx=0 tx=0)._handle_peer_banner_payload
supported=1 required=0
2021-07-25 16:28:42.934 7fcc62bdb700  1 --1-
[v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
conn(0x55c6b355ba80 0x55c6b3514800 :6789 s=ACCEPTING pgs=0 cs=0
l=0).send_server_banner sd=28 legacy v1:10.152.28.171:6789/0
socket_addr v1:10.152.28.171:6789/0 target_addr
v1:10.152.28.172:50976/0
2021-07-25 16:28:42.934 7fcc5ebd3700 10 mon.osd01@0(probing) e1
ms_handle_accept con 0x55c6b355ba80 no session
2021-07-25 16:28:42.934 7fcc633dc700 10 mon.osd01@0(probing) e1
handle_auth_request con 0x55c6b35a0d00 (start) method 2 payload 22
2021-07-25 16:28:42.934 7fcc633dc700 10 mon.osd01@0(probing) e1
handle_auth_request haven't formed initial quorum, EBUSY
2021-07-25 16:28:42.934 7fcc633dc700  1 --2-
[v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
conn(0x55c6b35a0d00 0x55c6b3325100 secure :-1 s=AUTH_ACCEPTING pgs=0
cs=0 l=1 rev1=1 rx=0 tx=0).stop
2021-07-25 16:28:42.934 7fcc5ebd3700 10 mon.osd01@0(probing) e1
ms_handle_reset 0x55c6b35a0d00 -
2021-07-25 16:28:42.934 7fcc5ebd3700  1 --
[v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] <== client.?
v1:10.152.28.172:0/3094543445 1 ==== auth(proto 0 34 bytes epoch 0) v1
==== 64+0+0 (unknown 4015746775 0 0) 0x55c6b351d840 con 0x55c6b355ba80
2021-07-25 16:28:42.934 7fcc62bdb700  1 --
[v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
v1:10.152.28.172:0/3094543445 conn(0x55c6b355ba80
legacy=0x55c6b3514800 unknown :6789 s=STATE_CONNECTION_ESTABLISHED
l=1).read_bulk peer close file descriptor 28
2021-07-25 16:28:42.934 7fcc62bdb700  1 --
[v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
v1:10.152.28.172:0/3094543445 conn(0x55c6b355ba80
legacy=0x55c6b3514800 unknown :6789 s=STATE_CONNECTION_ESTABLISHED
l=1).read_until read failed
2021-07-25 16:28:42.934 7fcc5ebd3700 10 mon.osd01@0(probing) e1
_ms_dispatch new session 0x55c6b351c880 MonSession(client.?
v1:10.152.28.172:0/3094543445 is open , features 0x3ffddff8ffecffff
(luminous)) features 0x3ffddff8ffecffff
2021-07-25 16:28:42.934 7fcc5ebd3700 20 mon.osd01@0(probing) e1
entity_name  global_id 0 (none) caps
2021-07-25 16:28:42.934 7fcc62bdb700  1 --1-
[v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
v1:10.152.28.172:0/3094543445 conn(0x55c6b355ba80 0x55c6b3514800 :6789
s=OPENED pgs=1 cs=1 l=1).handle_message read tag failed
2021-07-25 16:28:42.934 7fcc5ebd3700  5 mon.osd01@0(probing) e1
waitlisting message auth(proto 0 34 bytes epoch 0) v1
2021-07-25 16:28:42.934 7fcc62bdb700  1 --1-
[v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
v1:10.152.28.172:0/3094543445 conn(0x55c6b355ba80 0x55c6b3514800 :6789
s=OPENED pgs=1 cs=1 l=1).fault on lossy channel, failing
2021-07-25 16:28:42.934 7fcc5ebd3700 10 mon.osd01@0(probing) e1
ms_handle_reset 0x55c6b355ba80 v1:10.152.28.172:0/3094543445
2021-07-25 16:28:42.934 7fcc5ebd3700 10 mon.osd01@0(probing) e1
reset/close on session client.? v1:10.152.28.172:0/3094543445
2021-07-25 16:28:42.934 7fcc5ebd3700 10 mon.osd01@0(probing) e1
remove_session 0x55c6b351c880 client.? v1:10.152.28.172:0/3094543445
features 0x3ffddff8ffecffff
2021-07-25 16:28:42.938 7fcc5dbd1700  1 --2-
[v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
conn(0x55c6b3559b00 0x55c6b34ffc00 unknown :-1 s=NONE pgs=0 cs=0 l=0
rev1=0 rx=0 tx=0).accept
2021-07-25 16:28:42.938 7fcc633dc700  1 --2-
[v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
conn(0x55c6b3559b00 0x55c6b34ffc00 unknown :-1 s=BANNER_ACCEPTING
pgs=0 cs=0 l=0 rev1=0 rx=0 tx=0)._handle_peer_banner_payload
supported=1 required=0
2021-07-25 16:28:42.938 7fcc633dc700 10 mon.osd01@0(probing) e1
handle_auth_request con 0x55c6b3559b00 (start) method 2 payload 22
2021-07-25 16:28:42.938 7fcc633dc700 10 mon.osd01@0(probing) e1
handle_auth_request haven't formed initial quorum, EBUSY
2021-07-25 16:28:42.938 7fcc633dc700  1 --2-
[v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >>
conn(0x55c6b3559b00 0x55c6b34ffc00 secure :-1 s=AUTH_ACCEPTING pgs=0
cs=0 l=1 rev1=1 rx=0 tx=0).stop
2021-07-25 16:28:42.938 7fcc5ebd3700 10 mon.osd01@0(probing) e1
ms_handle_reset 0x55c6b3559b00 -
2021-07-25 16:28:43.418 7fcc613d8700  4 mon.osd01@0(probing) e1
probe_timeout 0x55c6b34ba0f0
2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing) e1 bootstrap
2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing) e1
sync_reset_requester
2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing) e1
unregister_cluster_logger - not registered
2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing) e1
cancel_probe_timeout (none scheduled)
2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing) e1 monmap
e1: 3 mons at {osd01=[v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0],osd02=[v2:10.152.28.172:3300/0,v1:10.152.28.172:6789/0],osd03=[v2:10.152.28.173:3300/0,v1:10.152.28.173:6789/0]}
2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing) e1 _reset
2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing).auth v0
_set_mon_num_rank num 0 rank 0
2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing) e1
cancel_probe_timeout (none scheduled)
2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing) e1 timecheck_finish
2021-07-25 16:28:43.418 7fcc613d8700 15 mon.osd01@0(probing) e1 health_tick_stop
2021-07-25 16:28:43.418 7fcc613d8700 15 mon.osd01@0(probing) e1
health_interval_stop
2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing) e1
scrub_event_cancel
2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing) e1 scrub_reset
2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing) e1
cancel_probe_timeout (none scheduled)
2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing) e1
reset_probe_timeout 0x55c6b3553260 after 2 seconds


still looks like a connection issue but I can connect! using telnet

root@osd01:~# telnet 10.152.28.172 6789
Trying 10.152.28.172...
Connected to 10.152.28.172.
Escape character is '^]'.
ceph v027



> .. Dan
>
>
>
>
>
> On Sun, 25 Jul 2021, 17:53 Ansgar Jazdzewski, <a.jazdzewski@xxxxxxxxxxxxxx> wrote:
>>
>> Am So., 25. Juli 2021 um 17:17 Uhr schrieb Dan van der Ster
>> <dan@xxxxxxxxxxxxxx>:
>> >
>> > > raise the min version to nautilus
>> >
>> > Are you referring to the min osd version or the min client version?
>>
>> yes sorry was not written clearly
>>
>> > I don't think the latter will help.
>> >
>> > Are you sure that mon.osd01 can reach those other mons on ports 6789 and 3300?
>>
>> yes I just tested it one more time ping MTU and telnet to all mon ports
>>
>> > Do you have any notable custom ceph configurations on this cluster?
>>
>> No, I did not think anything fancy
>>
>> [global]
>> cluster network = 10.152.40.0/22
>> fsid = a6baa789-6be2-4ce0-ab2d-7c78b899d4bd
>> mon host = 10.152.28.171,10.152.28.172,10.152.28.173
>> mon initial members = osd01,osd02,osd03
>> osd pool default crush rule = -1
>> public network = 10.152.28.0/22
>>
>>
>> I just tried to start the mon with force-sync but as the mon did not
>> join it will not pull any data
>> ceph-mon -f --cluster ceph --id osd01 --setuser ceph --setgroup ceph
>> --debug_mon 10 --yes-i-really-mean-it --force-sync -d
>>
>> I can try to move the /var/lib/ceph/mon/ dir and recreate it!?
>>
>>
>> thanks for all the help so far!
>> Ansgar
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux