Re: Ceph Mon not able to authenticate

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

You are not first with this issue
If you are on 146% sure that is not a network (arp, ip, mtu, firewall) issue - I suggest to remove this mon and deploy it again. Or deploy on another (unused) ipaddr
Also, you can add --debug_ms=20 and you should see some "lossy channel" messages before quorum join fails


k

> On 29 Mar 2022, at 15:20, Thomas Bruckmann <Thomas.Bruckmann@xxxxxxxxxxxxx> wrote:
> 
> Hello again,
> increased the Debug level now to a maximum for the mons and I still have no idea what the problem could be.
> 
> So I just print the Debug Log of the Mon failing to join here, in hope, someone could help me. In addition, it seems the mon not joining, stays quiet long in the probing phase, sometimes it switches to synchronizing, which seems to work and after that its back on probing.
> 
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 mon.controller2@-1(probing) e16 bootstrap
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 mon.controller2@-1(probing) e16 sync_reset_requester
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 mon.controller2@-1(probing) e16 unregister_cluster_logger - not registered
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 mon.controller2@-1(probing) e16 cancel_probe_timeout (none scheduled)
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 mon.controller2@-1(probing) e16 monmap e16: 3 mons at {controller1=[v2:192.168.9.206:3300/0,v1:192.168.9.206:6789/0],controller4=[v2:192.168.9.209:3300/0,v1:192.168.9.209:6789/0],controller5=[v2:192.168.9.210:3300/0,v1:192.168.9.210:6789/0]}
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 mon.controller2@-1(probing) e16 _reset
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 mon.controller2@-1(probing).auth v46972 _set_mon_num_rank num 0 rank 0
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 mon.controller2@-1(probing) e16 cancel_probe_timeout (none scheduled)
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 mon.controller2@-1(probing) e16 timecheck_finish
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 15 mon.controller2@-1(probing) e16 health_tick_stop
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 15 mon.controller2@-1(probing) e16 health_interval_stop
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 mon.controller2@-1(probing) e16 scrub_event_cancel
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 mon.controller2@-1(probing) e16 scrub_reset
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 mon.controller2@-1(probing) e16 cancel_probe_timeout (none scheduled)
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 mon.controller2@-1(probing) e16 reset_probe_timeout 0x55c46fbb8d80 after 2 seconds
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 mon.controller2@-1(probing) e16 probing other monitors
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20 mon.controller2@-1(probing) e16 _ms_dispatch existing session 0x55c46f8d4900 for mon.2
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20 mon.controller2@-1(probing) e16  entity_name  global_id 0 (none) caps allow *
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20 is_capable service=mon command= read addr v2:192.168.9.210:3300/0 on cap allow *
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20  allow so far , doing grant allow *
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20  allow all
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 mon.controller2@-1(probing) e16 handle_probe mon_probe(reply 9d036488-fb4f-4e5b-85ec-4ccf75501b48 name controller5 quorum 0,1,2 leader 0 paxos( fc 133912517 lc 133913211 ) mon_release pacific) v8
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 mon.controller2@-1(probing) e16 handle_probe_reply mon.2 v2:192.168.9.210:3300/0 mon_probe(reply 9d036488-fb4f-4e5b-85ec-4ccf75501b48 name controller5 quorum 0,1,2 leader 0 paxos( fc 133912517 lc 133913211 ) mon_release pacific) v8
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 mon.controller2@-1(probing) e16  monmap is e16: 3 mons at {controller1=[v2:192.168.9.206:3300/0,v1:192.168.9.206:6789/0],controller4=[v2:192.168.9.209:3300/0,v1:192.168.9.209:6789/0],controller5=[v2:192.168.9.210:3300/0,v1:192.168.9.210:6789/0]}
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 mon.controller2@-1(probing) e16  peer name is controller5
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 mon.controller2@-1(probing) e16  existing quorum 0,1,2
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 mon.controller2@-1(probing) e16  peer paxos version 133913211 vs my version 133913204 (ok)
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 mon.controller2@-1(probing) e16  ready to join, but i'm not in the monmap/my addr is blank/location is wrong, trying to join
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20 mon.controller2@-1(probing) e16 _ms_dispatch existing session 0x55c46f8d4b40 for mon.1
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20 mon.controller2@-1(probing) e16  entity_name  global_id 0 (none) caps allow *
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20 is_capable service=mon command= read addr v2:192.168.9.209:3300/0 on cap allow *
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20  allow so far , doing grant allow *
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20  allow all
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 mon.controller2@-1(probing) e16 handle_probe mon_probe(reply 9d036488-fb4f-4e5b-85ec-4ccf75501b48 name controller4 quorum 0,1,2 leader 0 paxos( fc 133912517 lc 133913211 ) mon_release pacific) v8
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 mon.controller2@-1(probing) e16 handle_probe_reply mon.1 v2:192.168.9.209:3300/0 mon_probe(reply 9d036488-fb4f-4e5b-85ec-4ccf75501b48 name controller4 quorum 0,1,2 leader 0 paxos( fc 133912517 lc 133913211 ) mon_release pacific) v8
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 mon.controller2@-1(probing) e16  monmap is e16: 3 mons at {controller1=[v2:192.168.9.206:3300/0,v1:192.168.9.206:6789/0],controller4=[v2:192.168.9.209:3300/0,v1:192.168.9.209:6789/0],controller5=[v2:192.168.9.210:3300/0,v1:192.168.9.210:6789/0]}
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 mon.controller2@-1(probing) e16  peer name is controller4
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 mon.controller2@-1(probing) e16  existing quorum 0,1,2
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 mon.controller2@-1(probing) e16  peer paxos version 133913211 vs my version 133913204 (ok)
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 mon.controller2@-1(probing) e16  ready to join, but i'm not in the monmap/my addr is blank/location is wrong, trying to join
> debug 2022-03-29T11:10:54.453+0000 7f81c2014700 10 mon.controller2@-1(probing) e16 get_authorizer for mgr
> debug 2022-03-29T11:10:55.453+0000 7f81c2014700 10 mon.controller2@-1(probing) e16 get_authorizer for mgr
> debug 2022-03-29T11:10:55.695+0000 7f81c0811700  4 mon.controller2@-1(probing) e16 probe_timeout 0x55c46fbb8d80
> debug 2022-03-29T11:10:55.695+0000 7f81c0811700 10 mon.controller2@-1(probing) e16 bootstrap
> 
> Kind Regards,
> Thomas Bruckmann
> Systemadministrator Cloud Dienste
> E
> Thomas.Bruckmann@xxxxxxxxxxxxx<mailto:%20Thomas.Bruckmann@xxxxxxxxxxxxx;>
> softgarden e-recruiting GmbH
> Tauentzienstraße 14 | 10789 Berlin
> https://softgarden.de/
> Gesellschaft mit beschränkter Haftung, Amtsgericht Berlin-Charlottenburg
> HRB 114159 B | USt-ID: DE260440441 | Geschäftsführer: Mathias Heese, Stefan Schüffler, Claus Müller
> 
> 
> Von: Thomas Bruckmann <Thomas.Bruckmann@xxxxxxxxxxxxx>
> Datum: Donnerstag, 24. März 2022 um 17:06
> An: ceph-users@xxxxxxx <ceph-users@xxxxxxx>
> Betreff:  Ceph Mon not able to authenticate
> Hello,
> We are running ceph 16.2.6 and having trouble with our mon’s everything is managed via ceph orch and running in containers. Since we switched our firewall in the DC (which also makes DNS) our ceph mon daemons are not able to authenticate when they are restarted.
> 
> The errormessage in the monitor log is:
> 
> debug 2022-03-24T14:25:12.716+0000 7fa0dc2df700 1 mon.2@-1(probing) e13 handle_auth_request failed to assign global_id
> 
> What we already tried to solve the problem:
> 
>  *   Removed the mon fully from the node (including all artifacts in the FS)
>  *   Doublechecked if the mon is still in the monmap after removing it (it is not)
>  *   Added other mons (which were previously no mons) to ensure a unique and synced monmap and tried adding the failing mon -> no success
>  *   Shutted down a running mon (no one of the brand new) and tried bringing it up again -> same error
> 
> It seems not to be an error with the monmap, however manipulating the monmap manually is currently not possible, since the system is prod and we cannot shutdown the whole FS.
> 
> Another Blogpost, I do not find the link anymore, say the problem could be related to the dns resolution somehow, that may the dns name behind the IP has changed. For each of our initial mons, we have 3 different DNS names, which are returned on a reverse lookup, since we switched the Firewall, may to order those names are returned has changed. Don’t know if this could be to problem.
> 
> Does may anyone has an Idea how to solve the Problem?
> 
> Kind Regards,
> Thomas Bruckmann
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux