Re: 1/3 mons down! mon do not rejoin

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



hi Dan, hi Folks,

I started the osd01 in the foreground with debugging and basically got
this loop! maybe it can help to raise the min version to nautilus but
I'm afraid to run those commands on a cluster in the current state

mon.osd01@0(probing).auth v0 _set_mon_num_rank num 0 rank 0
mon.osd01@0(probing) e1 cancel_probe_timeout (none scheduled)
mon.osd01@0(probing) e1 timecheck_finish
mon.osd01@0(probing) e1 scrub_event_cancel
mon.osd01@0(probing) e1 scrub_reset
mon.osd01@0(probing) e1 cancel_probe_timeout (none scheduled)
mon.osd01@0(probing) e1 reset_probe_timeout 0x560954c9c420 after 2 seconds
mon.osd01@0(probing) e1 probing other monitors
mon.osd01@0(probing) e1 handle_auth_request con 0x560954cfcd80 (start)
method 1 payload 22
mon.osd01@0(probing) e1 handle_auth_request haven't formed initial quorum, EBUSY
mon.osd01@0(probing) e1 ms_handle_reset 0x560954cfcd80 -
mon.osd01@0(probing) e1 handle_auth_request con 0x560954cbe880 (start)
method 1 payload 22
mon.osd01@0(probing) e1 handle_auth_request haven't formed initial quorum, EBUSY
mon.osd01@0(probing) e1 ms_handle_reset 0x560954cbe880 -
mon.osd01@0(probing) e1 ms_handle_accept con 0x560954cff600 no session
mon.osd01@0(probing) e1 handle_auth_request con 0x560954cffa80 (start)
method 2 payload 22
mon.osd01@0(probing) e1 handle_auth_request haven't formed initial quorum, EBUSY
mon.osd01@0(probing) e1 ms_handle_reset 0x560954cffa80 -
mon.osd01@0(probing) e1 _ms_dispatch new session 0x560954d5a000
MonSession(client.? v1:10.152.28.172:0/191360419 is open , features
0x3ffddff8ffecffff (luminous)) features 0x3ffddff8ffecffff
mon.osd01@0(probing) e1 waitlisting message auth(proto 0 34 bytes epoch 0) v1
mon.osd01@0(probing) e1 ms_handle_reset 0x560954cff600
v1:10.152.28.172:0/191360419
mon.osd01@0(probing) e1 reset/close on session client.?
v1:10.152.28.172:0/191360419
mon.osd01@0(probing) e1 remove_session 0x560954d5a000 client.?
v1:10.152.28.172:0/191360419 features 0x3ffddff8ffecffff
mon.osd01@0(probing) e1 handle_auth_request con 0x560954d36000 (start)
method 2 payload 22
mon.osd01@0(probing) e1 handle_auth_request haven't formed initial quorum, EBUSY
mon.osd01@0(probing) e1 ms_handle_reset 0x560954d36000 -
mon.osd01@0(probing) e1 handle_auth_request con 0x560954cffa80 (start)
method 2 payload 22
mon.osd01@0(probing) e1 handle_auth_request haven't formed initial quorum, EBUSY
mon.osd01@0(probing) e1 ms_handle_reset 0x560954cffa80 -
mon.osd01@0(probing) e1  trimming session 0x560954d36900 client.?
because we've been out of quorum too long
mon.osd01@0(probing) e1 remove_session 0x560954d5b440 client.?
v1:10.152.28.177:0/1121994743 features 0x3ffddff8ffecffff
mon.osd01@0(probing) e1  session closed, dropping 0x560954ce5680
mon.osd01@0(probing) e1  session closed, dropping 0x560954ce5f80
mon.osd01@0(probing) e1  session closed, dropping 0x560954d106c0
mon.osd01@0(probing) e1  session closed, dropping 0x560954d11680
mon.osd01@0(probing) e1  session closed, dropping 0x560954d11200
mon.osd01@0(probing) e1  session closed, dropping 0x560954ce4b40
mon.osd01@0(probing) e1  session closed, dropping 0x560954d5b8c0
mon.osd01@0(probing) e1 handle_auth_request con 0x560954cfd680 (start)
method 2 payload 23
mon.osd01@0(probing) e1 handle_auth_request haven't formed initial quorum, EBUSY
mon.osd01@0(probing) e1 ms_handle_reset 0x560954cfd680 -
mon.osd01@0(probing) e1 ms_handle_accept con 0x560954cfcd80 no session
mon.osd01@0(probing) e1 ms_handle_reset 0x560954cfcd80
v1:10.152.28.94:0/2655289329
mon.osd01@0(probing) e1 handle_auth_request con 0x560954d36000 (start)
method 2 payload 23
mon.osd01@0(probing) e1 handle_auth_request haven't formed initial quorum, EBUSY
mon.osd01@0(probing) e1 ms_handle_reset 0x560954d36000 -
mon.osd01@0(probing) e1 ms_handle_accept con 0x560954cffa80 no session
mon.osd01@0(probing) e1 _ms_dispatch new session 0x560954d5bf80
MonSession(client.? v1:10.152.28.179:0/1407444163 is open , features
0x3ffddff8ffecffff (luminous)) features 0x3ffddff8ffecffff
mon.osd01@0(probing) e1 waitlisting message auth(proto 0 30 bytes epoch 0) v1
mon.osd01@0(probing) e1 handle_auth_request con 0x560954cbe880 (start)
method 1 payload 41
mon.osd01@0(probing) e1 handle_auth_request haven't formed initial quorum, EBUSY
mon.osd01@0(probing) e1 ms_handle_reset 0x560954cbe880 -
mon.osd01@0(probing) e1 ms_get_authorizer for mon
mon.osd01@0(probing) e1 ms_get_authorizer for mon
mon.osd01@0(probing) e1 handle_auth_request con 0x560954cbdf80 (start)
method 1 payload 41
mon.osd01@0(probing) e1 handle_auth_request haven't formed initial quorum, EBUSY
mon.osd01@0(probing) e1 ms_handle_reset 0x560954cbdf80 -
mon.osd01@0(probing) e1 handle_auth_request con 0x560954cbd200 (start)
method 2 payload 23
mon.osd01@0(probing) e1 handle_auth_request haven't formed initial quorum, EBUSY
mon.osd01@0(probing) e1 ms_handle_reset 0x560954cbd200 -
mon.osd01@0(probing) e1 handle_auth_request con 0x560954d36000 (start)
method 1 payload 41
mon.osd01@0(probing) e1 handle_auth_request haven't formed initial quorum, EBUSY
mon.osd01@0(probing) e1 ms_handle_reset 0x560954d36000 -
mon.osd01@0(probing) e1 probe_timeout 0x560954c9c420
mon.osd01@0(probing) e1 bootstrap
mon.osd01@0(probing) e1 sync_reset_requester
mon.osd01@0(probing) e1 unregister_cluster_logger - not registered
mon.osd01@0(probing) e1 cancel_probe_timeout (none scheduled)
mon.osd01@0(probing) e1 monmap e1: 3 mons at
{osd01=[v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0],osd02=[v2:10.152.28.172:3300/0,v1:10.152.28.172:6789/0],osd03=[v2:10.152.28.173:3300/0,v1:10.152.28.173:6789/0]}
mon.osd01@0(probing) e1 _reset

Am So., 25. Juli 2021 um 13:24 Uhr schrieb Dan van der Ster
<dan@xxxxxxxxxxxxxx>:
>
> With four mons total then only one can be down... mon.osd01 is down already you're at the limit.
>
> It's possible that whichever reason is preventing this mon from joining will also prevent the new mon from joining.
>
> I think you should:
>
> 1. Investigate why mon.osd01 isn't coming back into the quorum... The logs on that mon or the others can help.
> 2. If you decide to give up on mon.osd01, then first you should rm it from the cluster before you add a mon from another host.
>
> .. Dan
>
>
> On Sun, 25 Jul 2021, 12:43 Ansgar Jazdzewski, <a.jazdzewski@xxxxxxxxxxxxxx> wrote:
>>
>> hi folks
>>
>> I have a cluster running ceph 14.2.22 on ubuntu 18.04 and some hours
>> ago one of the mons stopped working and the on-call team rebooted the
>> node; not the mon is is not joining the ceph-cluster.
>>
>> TCP ports of mons are open and reachable!
>>
>> ceph health detail
>> HEALTH_WARN 1/3 mons down, quorum osd02,osd03
>> MON_DOWN 1/3 mons down, quorum osd02,osd03
>>     mon.osd01 (rank 0) addr
>> [v2:10.152.28.171:3300/0,v1:10.152.28.171:6789/0] is down (out of
>> quorum)
>>
>> I like to add a new 3rd mon to the cluster on osd04 but I'm a bit
>> scared as it can result in 50% of the mons are not in reach!?
>>
>> Question: should I remove the mon on osd01 first and recreate the
>> demon before starting a new mon on osd04?
>>
>>
>> Thanks for your input!
>> Ansgar
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux