Re: 1/3 mons down! mon do not rejoin

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> raise the min version to nautilus

Are you referring to the min osd version or the min client version?

I don't think the latter will help.

Are you sure that mon.osd01 can reach those other mons on ports 6789 and
3300?

Do you have any notable custom ceph configurations on this cluster?

.. Dan




On Sun, 25 Jul 2021, 17:04 Ansgar Jazdzewski, <a.jazdzewski@xxxxxxxxxxxxxx>
wrote:

> hi Dan, hi Folks,
>
> I started the osd01 in the foreground with debugging and basically got
> this loop! maybe it can help to raise the min version to nautilus but
> I'm afraid to run those commands on a cluster in the current state
>
> mon.osd01@0(probing).auth v0 _set_mon_num_rank num 0 rank 0
> mon.osd01@0(probing) e1 cancel_probe_timeout (none scheduled)
> mon.osd01@0(probing) e1 timecheck_finish
> mon.osd01@0(probing) e1 scrub_event_cancel
> mon.osd01@0(probing) e1 scrub_reset
> mon.osd01@0(probing) e1 cancel_probe_timeout (none scheduled)
> mon.osd01@0(probing) e1 reset_probe_timeout 0x560954c9c420 after 2 seconds
> mon.osd01@0(probing) e1 probing other monitors
> mon.osd01@0(probing) e1 handle_auth_request con 0x560954cfcd80 (start)
> method 1 payload 22
> mon.osd01@0(probing) e1 handle_auth_request haven't formed initial
> quorum, EBUSY
> mon.osd01@0(probing) e1 ms_handle_reset 0x560954cfcd80 -
> mon.osd01@0(probing) e1 handle_auth_request con 0x560954cbe880 (start)
> method 1 payload 22
> mon.osd01@0(probing) e1 handle_auth_request haven't formed initial
> quorum, EBUSY
> mon.osd01@0(probing) e1 ms_handle_reset 0x560954cbe880 -
> mon.osd01@0(probing) e1 ms_handle_accept con 0x560954cff600 no session
> mon.osd01@0(probing) e1 handle_auth_request con 0x560954cffa80 (start)
> method 2 payload 22
> mon.osd01@0(probing) e1 handle_auth_request haven't formed initial
> quorum, EBUSY
> mon.osd01@0(probing) e1 ms_handle_reset 0x560954cffa80 -
> mon.osd01@0(probing) e1 _ms_dispatch new session 0x560954d5a000
> MonSession(client.? v1:10.152.28.172:0/191360419 is open , features
> 0x3ffddff8ffecffff (luminous)) features 0x3ffddff8ffecffff
> mon.osd01@0(probing) e1 waitlisting message auth(proto 0 34 bytes epoch
> 0) v1
> mon.osd01@0(probing) e1 ms_handle_reset 0x560954cff600
> v1:10.152.28.172:0/191360419
> mon.osd01@0(probing) e1 reset/close on session client.?
> v1:10.152.28.172:0/191360419
> mon.osd01@0(probing) e1 remove_session 0x560954d5a000 client.?
> v1:10.152.28.172:0/191360419 features 0x3ffddff8ffecffff
> mon.osd01@0(probing) e1 handle_auth_request con 0x560954d36000 (start)
> method 2 payload 22
> mon.osd01@0(probing) e1 handle_auth_request haven't formed initial
> quorum, EBUSY
> mon.osd01@0(probing) e1 ms_handle_reset 0x560954d36000 -
> mon.osd01@0(probing) e1 handle_auth_request con 0x560954cffa80 (start)
> method 2 payload 22
> mon.osd01@0(probing) e1 handle_auth_request haven't formed initial
> quorum, EBUSY
> mon.osd01@0(probing) e1 ms_handle_reset 0x560954cffa80 -
> mon.osd01@0(probing) e1  trimming session 0x560954d36900 client.?
> because we've been out of quorum too long
> mon.osd01@0(probing) e1 remove_session 0x560954d5b440 client.?
> v1:10.152.28.177:0/1121994743 features 0x3ffddff8ffecffff
> mon.osd01@0(probing) e1  session closed, dropping 0x560954ce5680
> mon.osd01@0(probing) e1  session closed, dropping 0x560954ce5f80
> mon.osd01@0(probing) e1  session closed, dropping 0x560954d106c0
> mon.osd01@0(probing) e1  session closed, dropping 0x560954d11680
> mon.osd01@0(probing) e1  session closed, dropping 0x560954d11200
> mon.osd01@0(probing) e1  session closed, dropping 0x560954ce4b40
> mon.osd01@0(probing) e1  session closed, dropping 0x560954d5b8c0
> mon.osd01@0(probing) e1 handle_auth_request con 0x560954cfd680 (start)
> method 2 payload 23
> mon.osd01@0(probing) e1 handle_auth_request haven't formed initial
> quorum, EBUSY
> mon.osd01@0(probing) e1 ms_handle_reset 0x560954cfd680 -
> mon.osd01@0(probing) e1 ms_handle_accept con 0x560954cfcd80 no session
> mon.osd01@0(probing) e1 ms_handle_reset 0x560954cfcd80
> v1:10.152.28.94:0/2655289329
> mon.osd01@0(probing) e1 handle_auth_request con 0x560954d36000 (start)
> method 2 payload 23
> mon.osd01@0(probing) e1 handle_auth_request haven't formed initial
> quorum, EBUSY
> mon.osd01@0(probing) e1 ms_handle_reset 0x560954d36000 -
> mon.osd01@0(probing) e1 ms_handle_accept con 0x560954cffa80 no session
> mon.osd01@0(probing) e1 _ms_dispatch new session 0x560954d5bf80
> MonSession(client.? v1:10.152.28.179:0/1407444163 is open , features
> 0x3ffddff8ffecffff (luminous)) features 0x3ffddff8ffecffff
> mon.osd01@0(probing) e1 waitlisting message auth(proto 0 30 bytes epoch
> 0) v1
> mon.osd01@0(probing) e1 handle_auth_request con 0x560954cbe880 (start)
> method 1 payload 41
> mon.osd01@0(probing) e1 handle_auth_request haven't formed initial
> quorum, EBUSY
> mon.osd01@0(probing) e1 ms_handle_reset 0x560954cbe880 -
> mon.osd01@0(probing) e1 ms_get_authorizer for mon
> mon.osd01@0(probing) e1 ms_get_authorizer for mon
> mon.osd01@0(probing) e1 handle_auth_request con 0x560954cbdf80 (start)
> method 1 payload 41
> mon.osd01@0(probing) e1 handle_auth_request haven't formed initial
> quorum, EBUSY
> mon.osd01@0(probing) e1 ms_handle_reset 0x560954cbdf80 -
> mon.osd01@0(probing) e1 handle_auth_request con 0x560954cbd200 (start)
> method 2 payload 23
> mon.osd01@0(probing) e1 handle_auth_request haven't formed initial
> quorum, EBUSY
> mon.osd01@0(probing) e1 ms_handle_reset 0x560954cbd200 -
> mon.osd01@0(probing) e1 handle_auth_request con 0x560954d36000 (start)
> method 1 payload 41
> mon.osd01@0(probing) e1 handle_auth_request haven't formed initial
> quorum, EBUSY
> mon.osd01@0(probing) e1 ms_handle_reset 0x560954d36000 -
> mon.osd01@0(probing) e1 probe_timeout 0x560954c9c420
> mon.osd01@0(probing) e1 bootstrap
> mon.osd01@0(probing) e1 sync_reset_requester
> mon.osd01@0(probing) e1 unregister_cluster_logger - not registered
> mon.osd01@0(probing) e1 cancel_probe_timeout (none scheduled)
> mon.osd01@0(probing) e1 monmap e1: 3 mons at
> {osd01=[v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0],osd02=[v2:
> 10.152.28.172:3300/0,v1:10.152.28.172:6789/0],osd03=[v2:
> 10.152.28.173:3300/0,v1:10.152.28.173:6789/0]}
> mon.osd01@0(probing) e1 _reset
>
> Am So., 25. Juli 2021 um 13:24 Uhr schrieb Dan van der Ster
> <dan@xxxxxxxxxxxxxx>:
> >
> > With four mons total then only one can be down... mon.osd01 is down
> already you're at the limit.
> >
> > It's possible that whichever reason is preventing this mon from joining
> will also prevent the new mon from joining.
> >
> > I think you should:
> >
> > 1. Investigate why mon.osd01 isn't coming back into the quorum... The
> logs on that mon or the others can help.
> > 2. If you decide to give up on mon.osd01, then first you should rm it
> from the cluster before you add a mon from another host.
> >
> > .. Dan
> >
> >
> > On Sun, 25 Jul 2021, 12:43 Ansgar Jazdzewski, <
> a.jazdzewski@xxxxxxxxxxxxxx> wrote:
> >>
> >> hi folks
> >>
> >> I have a cluster running ceph 14.2.22 on ubuntu 18.04 and some hours
> >> ago one of the mons stopped working and the on-call team rebooted the
> >> node; not the mon is is not joining the ceph-cluster.
> >>
> >> TCP ports of mons are open and reachable!
> >>
> >> ceph health detail
> >> HEALTH_WARN 1/3 mons down, quorum osd02,osd03
> >> MON_DOWN 1/3 mons down, quorum osd02,osd03
> >>     mon.osd01 (rank 0) addr
> >> [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] is down (out of
> >> quorum)
> >>
> >> I like to add a new 3rd mon to the cluster on osd04 but I'm a bit
> >> scared as it can result in 50% of the mons are not in reach!?
> >>
> >> Question: should I remove the mon on osd01 first and recreate the
> >> demon before starting a new mon on osd04?
> >>
> >>
> >> Thanks for your input!
> >> Ansgar
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux