Hi, Do you have ceph-mon logs from when mon.osd01 first failed before the on-call team rebooted it? They might give a clue what happened to start this problem, which maybe is still happening now. This looks similar but it was eventually found to be a network issue: https://tracker.ceph.com/issues/48033 -- Dan On Sun, Jul 25, 2021 at 6:36 PM Ansgar Jazdzewski <a.jazdzewski@xxxxxxxxxxxxxx> wrote: > > Am So., 25. Juli 2021 um 18:02 Uhr schrieb Dan van der Ster > <dan@xxxxxxxxxxxxxx>: > > > > What do you have for the new global_id settings? Maybe set it to allow insecure global_id auth and see if that allows the mon to join? > > auth_allow_insecure_global_id_reclaim is allowed as we still have > some VM's not restarted > > # ceph config get mon.* > WHO MASK LEVEL OPTION VALUE RO > mon advanced auth_allow_insecure_global_id_reclaim true > mon advanced mon_warn_on_insecure_global_id_reclaim false > mon advanced mon_warn_on_insecure_global_id_reclaim_allowed false > > > > I can try to move the /var/lib/ceph/mon/ dir and recreate it!? > > > > I'm not sure it will help. Running the mon with --debug_ms=1 might give clues why it's stuck probing. > > 2021-07-25 16:28:41.418 7fcc613d8700 10 mon.osd01@0(probing) e1 > probing other monitors > 2021-07-25 16:28:41.418 7fcc613d8700 1 -- > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] send_to--> mon > [v2:10.152.28.172:3300/0,v1:10.152.28.172:6789/0] -- mon_probe(probe > a6baa789-6be2-4ce0-ab2d-7c78b899d4bd name osd01 mon_release 14) v7 -- > ?+0 0x55c6b35ae780 > 2021-07-25 16:28:41.418 7fcc613d8700 1 -- > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] --> > [v2:10.152.28.172:3300/0,v1:10.152.28.172:6789/0] -- mon_probe(probe > a6baa789-6be2-4ce0-ab2d-7c78b899d4bd name osd01 mon_release 14) v7 -- > 0x55c6b35ae780 con 0x55c6b2611180 > 2021-07-25 16:28:41.418 7fcc613d8700 1 -- > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] send_to--> mon > [v2:10.152.28.173:3300/0,v1:10.152.28.173:6789/0] -- mon_probe(probe > a6baa789-6be2-4ce0-ab2d-7c78b899d4bd name osd01 mon_release 14) v7 -- > ?+0 0x55c6b35aea00 > 2021-07-25 16:28:41.418 7fcc613d8700 1 -- > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] --> > [v2:10.152.28.173:3300/0,v1:10.152.28.173:6789/0] -- mon_probe(probe > a6baa789-6be2-4ce0-ab2d-7c78b899d4bd name osd01 mon_release 14) v7 -- > 0x55c6b35aea00 con 0x55c6b2611600 > 2021-07-25 16:28:41.814 7fcc5dbd1700 1 --2- > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >> > [v2:10.152.28.173:3300/0,v1:10.152.28.173:6789/0] conn(0x55c6b2611600 > 0x55c6b3323c00 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=0 rev1=1 > rx=0 tx=0)._handle_peer_banner_payload supported=1 required=0 > 2021-07-25 16:28:41.814 7fcc62bdb700 1 --2- > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >> > [v2:10.152.28.172:3300/0,v1:10.152.28.172:6789/0] conn(0x55c6b2611180 > 0x55c6b3323500 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=0 rev1=1 > rx=0 tx=0)._handle_peer_banner_payload supported=1 required=0 > 2021-07-25 16:28:41.814 7fcc62bdb700 10 mon.osd01@0(probing) e1 > ms_get_authorizer for mon > 2021-07-25 16:28:41.814 7fcc5dbd1700 10 mon.osd01@0(probing) e1 > ms_get_authorizer for mon > 2021-07-25 16:28:41.814 7fcc62bdb700 1 -- > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >> > [v2:10.152.28.172:3300/0,v1:10.152.28.172:6789/0] conn(0x55c6b2611180 > msgr2=0x55c6b3323500 secure :-1 s=STATE_CONNECTION_ESTABLISHED > l=0).read_bulk peer close file descriptor 27 > 2021-07-25 16:28:41.814 7fcc62bdb700 1 -- > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >> > [v2:10.152.28.172:3300/0,v1:10.152.28.172:6789/0] conn(0x55c6b2611180 > msgr2=0x55c6b3323500 secure :-1 s=STATE_CONNECTION_ESTABLISHED > l=0).read_until read failed > 2021-07-25 16:28:41.814 7fcc62bdb700 1 --2- > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >> > [v2:10.152.28.172:3300/0,v1:10.152.28.172:6789/0] conn(0x55c6b2611180 > 0x55c6b3323500 secure :-1 s=SESSION_CONNECTING pgs=0 cs=0 l=0 rev1=1 > rx=0x55c6b34bbad0 tx=0x55c6b3528130).handle_read_frame_preamble_main > read frame preamble failed r=-1 ((1) Operation not permitted) > 2021-07-25 16:28:41.814 7fcc5dbd1700 1 -- > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >> > [v2:10.152.28.173:3300/0,v1:10.152.28.173:6789/0] conn(0x55c6b2611600 > msgr2=0x55c6b3323c00 secure :-1 s=STATE_CONNECTION_ESTABLISHED > l=0).read_bulk peer close file descriptor 28 > 2021-07-25 16:28:41.814 7fcc5dbd1700 1 -- > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >> > [v2:10.152.28.173:3300/0,v1:10.152.28.173:6789/0] conn(0x55c6b2611600 > msgr2=0x55c6b3323c00 secure :-1 s=STATE_CONNECTION_ESTABLISHED > l=0).read_until read failed > 2021-07-25 16:28:41.814 7fcc5dbd1700 1 --2- > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >> > [v2:10.152.28.173:3300/0,v1:10.152.28.173:6789/0] conn(0x55c6b2611600 > 0x55c6b3323c00 secure :-1 s=SESSION_CONNECTING pgs=0 cs=0 l=0 rev1=1 > rx=0x55c6b3553830 tx=0x55c6b34809a0).handle_read_frame_preamble_main > read frame preamble failed r=-1 ((1) Operation not permitted) > 2021-07-25 16:28:41.814 7fcc62bdb700 1 --2- > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >> > [v2:10.152.28.172:3300/0,v1:10.152.28.172:6789/0] conn(0x55c6b2611180 > 0x55c6b3323500 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=0 rev1=1 rx=0 > tx=0)._fault waiting 15.000000 > 2021-07-25 16:28:41.814 7fcc5dbd1700 1 --2- > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >> > [v2:10.152.28.173:3300/0,v1:10.152.28.173:6789/0] conn(0x55c6b2611600 > 0x55c6b3323c00 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=0 rev1=1 rx=0 > tx=0)._fault waiting 15.000000 > 2021-07-25 16:28:42.934 7fcc5dbd1700 1 --2- > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >> > conn(0x55c6b35a0d00 0x55c6b3325100 unknown :-1 s=NONE pgs=0 cs=0 l=0 > rev1=0 rx=0 tx=0).accept > 2021-07-25 16:28:42.934 7fcc633dc700 1 --2- > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >> > conn(0x55c6b35a0d00 0x55c6b3325100 unknown :-1 s=BANNER_ACCEPTING > pgs=0 cs=0 l=0 rev1=0 rx=0 tx=0)._handle_peer_banner_payload > supported=1 required=0 > 2021-07-25 16:28:42.934 7fcc62bdb700 1 --1- > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >> > conn(0x55c6b355ba80 0x55c6b3514800 :6789 s=ACCEPTING pgs=0 cs=0 > l=0).send_server_banner sd=28 legacy v1:10.152.28.171:6789/0 > socket_addr v1:10.152.28.171:6789/0 target_addr > v1:10.152.28.172:50976/0 > 2021-07-25 16:28:42.934 7fcc5ebd3700 10 mon.osd01@0(probing) e1 > ms_handle_accept con 0x55c6b355ba80 no session > 2021-07-25 16:28:42.934 7fcc633dc700 10 mon.osd01@0(probing) e1 > handle_auth_request con 0x55c6b35a0d00 (start) method 2 payload 22 > 2021-07-25 16:28:42.934 7fcc633dc700 10 mon.osd01@0(probing) e1 > handle_auth_request haven't formed initial quorum, EBUSY > 2021-07-25 16:28:42.934 7fcc633dc700 1 --2- > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >> > conn(0x55c6b35a0d00 0x55c6b3325100 secure :-1 s=AUTH_ACCEPTING pgs=0 > cs=0 l=1 rev1=1 rx=0 tx=0).stop > 2021-07-25 16:28:42.934 7fcc5ebd3700 10 mon.osd01@0(probing) e1 > ms_handle_reset 0x55c6b35a0d00 - > 2021-07-25 16:28:42.934 7fcc5ebd3700 1 -- > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] <== client.? > v1:10.152.28.172:0/3094543445 1 ==== auth(proto 0 34 bytes epoch 0) v1 > ==== 64+0+0 (unknown 4015746775 0 0) 0x55c6b351d840 con 0x55c6b355ba80 > 2021-07-25 16:28:42.934 7fcc62bdb700 1 -- > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >> > v1:10.152.28.172:0/3094543445 conn(0x55c6b355ba80 > legacy=0x55c6b3514800 unknown :6789 s=STATE_CONNECTION_ESTABLISHED > l=1).read_bulk peer close file descriptor 28 > 2021-07-25 16:28:42.934 7fcc62bdb700 1 -- > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >> > v1:10.152.28.172:0/3094543445 conn(0x55c6b355ba80 > legacy=0x55c6b3514800 unknown :6789 s=STATE_CONNECTION_ESTABLISHED > l=1).read_until read failed > 2021-07-25 16:28:42.934 7fcc5ebd3700 10 mon.osd01@0(probing) e1 > _ms_dispatch new session 0x55c6b351c880 MonSession(client.? > v1:10.152.28.172:0/3094543445 is open , features 0x3ffddff8ffecffff > (luminous)) features 0x3ffddff8ffecffff > 2021-07-25 16:28:42.934 7fcc5ebd3700 20 mon.osd01@0(probing) e1 > entity_name global_id 0 (none) caps > 2021-07-25 16:28:42.934 7fcc62bdb700 1 --1- > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >> > v1:10.152.28.172:0/3094543445 conn(0x55c6b355ba80 0x55c6b3514800 :6789 > s=OPENED pgs=1 cs=1 l=1).handle_message read tag failed > 2021-07-25 16:28:42.934 7fcc5ebd3700 5 mon.osd01@0(probing) e1 > waitlisting message auth(proto 0 34 bytes epoch 0) v1 > 2021-07-25 16:28:42.934 7fcc62bdb700 1 --1- > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >> > v1:10.152.28.172:0/3094543445 conn(0x55c6b355ba80 0x55c6b3514800 :6789 > s=OPENED pgs=1 cs=1 l=1).fault on lossy channel, failing > 2021-07-25 16:28:42.934 7fcc5ebd3700 10 mon.osd01@0(probing) e1 > ms_handle_reset 0x55c6b355ba80 v1:10.152.28.172:0/3094543445 > 2021-07-25 16:28:42.934 7fcc5ebd3700 10 mon.osd01@0(probing) e1 > reset/close on session client.? v1:10.152.28.172:0/3094543445 > 2021-07-25 16:28:42.934 7fcc5ebd3700 10 mon.osd01@0(probing) e1 > remove_session 0x55c6b351c880 client.? v1:10.152.28.172:0/3094543445 > features 0x3ffddff8ffecffff > 2021-07-25 16:28:42.938 7fcc5dbd1700 1 --2- > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >> > conn(0x55c6b3559b00 0x55c6b34ffc00 unknown :-1 s=NONE pgs=0 cs=0 l=0 > rev1=0 rx=0 tx=0).accept > 2021-07-25 16:28:42.938 7fcc633dc700 1 --2- > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >> > conn(0x55c6b3559b00 0x55c6b34ffc00 unknown :-1 s=BANNER_ACCEPTING > pgs=0 cs=0 l=0 rev1=0 rx=0 tx=0)._handle_peer_banner_payload > supported=1 required=0 > 2021-07-25 16:28:42.938 7fcc633dc700 10 mon.osd01@0(probing) e1 > handle_auth_request con 0x55c6b3559b00 (start) method 2 payload 22 > 2021-07-25 16:28:42.938 7fcc633dc700 10 mon.osd01@0(probing) e1 > handle_auth_request haven't formed initial quorum, EBUSY > 2021-07-25 16:28:42.938 7fcc633dc700 1 --2- > [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] >> > conn(0x55c6b3559b00 0x55c6b34ffc00 secure :-1 s=AUTH_ACCEPTING pgs=0 > cs=0 l=1 rev1=1 rx=0 tx=0).stop > 2021-07-25 16:28:42.938 7fcc5ebd3700 10 mon.osd01@0(probing) e1 > ms_handle_reset 0x55c6b3559b00 - > 2021-07-25 16:28:43.418 7fcc613d8700 4 mon.osd01@0(probing) e1 > probe_timeout 0x55c6b34ba0f0 > 2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing) e1 bootstrap > 2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing) e1 > sync_reset_requester > 2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing) e1 > unregister_cluster_logger - not registered > 2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing) e1 > cancel_probe_timeout (none scheduled) > 2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing) e1 monmap > e1: 3 mons at {osd01=[v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0],osd02=[v2:10.152.28.172:3300/0,v1:10.152.28.172:6789/0],osd03=[v2:10.152.28.173:3300/0,v1:10.152.28.173:6789/0]} > 2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing) e1 _reset > 2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing).auth v0 > _set_mon_num_rank num 0 rank 0 > 2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing) e1 > cancel_probe_timeout (none scheduled) > 2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing) e1 timecheck_finish > 2021-07-25 16:28:43.418 7fcc613d8700 15 mon.osd01@0(probing) e1 health_tick_stop > 2021-07-25 16:28:43.418 7fcc613d8700 15 mon.osd01@0(probing) e1 > health_interval_stop > 2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing) e1 > scrub_event_cancel > 2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing) e1 scrub_reset > 2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing) e1 > cancel_probe_timeout (none scheduled) > 2021-07-25 16:28:43.418 7fcc613d8700 10 mon.osd01@0(probing) e1 > reset_probe_timeout 0x55c6b3553260 after 2 seconds > > > still looks like a connection issue but I can connect! using telnet > > root@osd01:~# telnet 10.152.28.172 6789 > Trying 10.152.28.172... > Connected to 10.152.28.172. > Escape character is '^]'. > ceph v027 > > > > > .. Dan > > > > > > > > > > > > On Sun, 25 Jul 2021, 17:53 Ansgar Jazdzewski, <a.jazdzewski@xxxxxxxxxxxxxx> wrote: > >> > >> Am So., 25. Juli 2021 um 17:17 Uhr schrieb Dan van der Ster > >> <dan@xxxxxxxxxxxxxx>: > >> > > >> > > raise the min version to nautilus > >> > > >> > Are you referring to the min osd version or the min client version? > >> > >> yes sorry was not written clearly > >> > >> > I don't think the latter will help. > >> > > >> > Are you sure that mon.osd01 can reach those other mons on ports 6789 and 3300? > >> > >> yes I just tested it one more time ping MTU and telnet to all mon ports > >> > >> > Do you have any notable custom ceph configurations on this cluster? > >> > >> No, I did not think anything fancy > >> > >> [global] > >> cluster network = 10.152.40.0/22 > >> fsid = a6baa789-6be2-4ce0-ab2d-7c78b899d4bd > >> mon host = 10.152.28.171,10.152.28.172,10.152.28.173 > >> mon initial members = osd01,osd02,osd03 > >> osd pool default crush rule = -1 > >> public network = 10.152.28.0/22 > >> > >> > >> I just tried to start the mon with force-sync but as the mon did not > >> join it will not pull any data > >> ceph-mon -f --cluster ceph --id osd01 --setuser ceph --setgroup ceph > >> --debug_mon 10 --yes-i-really-mean-it --force-sync -d > >> > >> I can try to move the /var/lib/ceph/mon/ dir and recreate it!? > >> > >> > >> thanks for all the help so far! > >> Ansgar _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx