Re: Ceph Mon not able to authenticate

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Konstantin,
Thank you for your reply!
Yes, we meanwhile doublechecked that the network is working correctly, those nodes are also used as K8S workers and all the containers etc. is running correctly, also the OSD, the MGR’s and the MDS Containers are running fine. The only problem are the mon containers.

And those mons are not coming up on the nodes we use since years as mons, its no problem to deploy a mon to any of the other K8S worker node in the same subnet.

I also enabled debug logs level 20 and I cannot find any reason or any hint in those logs, why this mons are not able to join.

Interestingly, it is not only on one machine, it is on all old mon machines, until the mon container is redeployed or restarted, it does not come up again. And a hardware error on 3 machines the same time is quiet unlikely.

I also fully redeployed the mon containers and ensured after removing, that absolutely no artifacts are still on the machine, so that nothing is left in
/var/lib/ceph/<id>/mon.*
/var/run/ceph/<id>
/var/lib/ceph/<id>/crash

The only directories mounted to the container where I did not delete files are /dev, /udev and /var/log/ceph/<id>. Additionally I removed the stopped ceph containers before redeploying. I have no idea how I could more remove a ceph mon from a machine 😃

Hope you or someone else may still has an Idea or at least a direction, even if you have any hint what exactly to check at the network stack of the server, everything is welcome.

Kind Regards,
Thomas Bruckmann
Systemadministrator Cloud Dienste
E
Thomas.Bruckmann@xxxxxxxxxxxxx<mailto:%20Thomas.Bruckmann@xxxxxxxxxxxxx;>
softgarden e-recruiting GmbH
Tauentzienstraße 14 | 10789 Berlin
https://softgarden.com<https://softgarden.com/de>/de
Gesellschaft mit beschränkter Haftung, Amtsgericht Berlin-Charlottenburg
HRB 114159 B | USt-ID: DE260440441 | Geschäftsführer: Mathias Heese, Stefan Schüffler, Claus Müller


Von: Konstantin Shalygin <k0ste@xxxxxxxx>
Datum: Mittwoch, 30. März 2022 um 10:05
An: Thomas Bruckmann <Thomas.Bruckmann@xxxxxxxxxxxxx>
Cc: ceph-users@xxxxxxx <ceph-users@xxxxxxx>
Betreff: Re:  Ceph Mon not able to authenticate
Hi,

You are not first with this issue
If you are on 146% sure that is not a network (arp, ip, mtu, firewall) issue - I suggest to remove this mon and deploy it again. Or deploy on another (unused) ipaddr
Also, you can add --debug_ms=20 and you should see some "lossy channel" messages before quorum join fails


k

> On 29 Mar 2022, at 15:20, Thomas Bruckmann <Thomas.Bruckmann@xxxxxxxxxxxxx> wrote:
>
> Hello again,
> increased the Debug level now to a maximum for the mons and I still have no idea what the problem could be.
>
> So I just print the Debug Log of the Mon failing to join here, in hope, someone could help me. In addition, it seems the mon not joining, stays quiet long in the probing phase, sometimes it switches to synchronizing, which seems to work and after that its back on probing.
>
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 mon.controller2@-1(probing) e16 bootstrap
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 mon.controller2@-1(probing) e16 sync_reset_requester
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 mon.controller2@-1(probing) e16 unregister_cluster_logger - not registered
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 mon.controller2@-1(probing) e16 cancel_probe_timeout (none scheduled)
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 mon.controller2@-1(probing) e16 monmap e16: 3 mons at {controller1=[v2:192.168.9.206:3300/0,v1:192.168.9.206:6789/0],controller4=[v2:192.168.9.209:3300/0,v1:192.168.9.209:6789/0],controller5=[v2:192.168.9.210:3300/0,v1:192.168.9.210:6789/0]}
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 mon.controller2@-1(probing) e16 _reset
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 mon.controller2@-1(probing).auth v46972 _set_mon_num_rank num 0 rank 0
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 mon.controller2@-1(probing) e16 cancel_probe_timeout (none scheduled)
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 mon.controller2@-1(probing) e16 timecheck_finish
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 15 mon.controller2@-1(probing) e16 health_tick_stop
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 15 mon.controller2@-1(probing) e16 health_interval_stop
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 mon.controller2@-1(probing) e16 scrub_event_cancel
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 mon.controller2@-1(probing) e16 scrub_reset
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 mon.controller2@-1(probing) e16 cancel_probe_timeout (none scheduled)
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 mon.controller2@-1(probing) e16 reset_probe_timeout 0x55c46fbb8d80 after 2 seconds
> debug 2022-03-29T11:10:53.695+0000 7f81c0811700 10 mon.controller2@-1(probing) e16 probing other monitors
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20 mon.controller2@-1(probing) e16 _ms_dispatch existing session 0x55c46f8d4900 for mon.2
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20 mon.controller2@-1(probing) e16  entity_name  global_id 0 (none) caps allow *
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20 is_capable service=mon command= read addr v2:192.168.9.210:3300/0 on cap allow *
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20  allow so far , doing grant allow *
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20  allow all
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 mon.controller2@-1(probing) e16 handle_probe mon_probe(reply 9d036488-fb4f-4e5b-85ec-4ccf75501b48 name controller5 quorum 0,1,2 leader 0 paxos( fc 133912517 lc 133913211 ) mon_release pacific) v8
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 mon.controller2@-1(probing) e16 handle_probe_reply mon.2 v2:192.168.9.210:3300/0 mon_probe(reply 9d036488-fb4f-4e5b-85ec-4ccf75501b48 name controller5 quorum 0,1,2 leader 0 paxos( fc 133912517 lc 133913211 ) mon_release pacific) v8
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 mon.controller2@-1(probing) e16  monmap is e16: 3 mons at {controller1=[v2:192.168.9.206:3300/0,v1:192.168.9.206:6789/0],controller4=[v2:192.168.9.209:3300/0,v1:192.168.9.209:6789/0],controller5=[v2:192.168.9.210:3300/0,v1:192.168.9.210:6789/0]}
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 mon.controller2@-1(probing) e16  peer name is controller5
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 mon.controller2@-1(probing) e16  existing quorum 0,1,2
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 mon.controller2@-1(probing) e16  peer paxos version 133913211 vs my version 133913204 (ok)
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 mon.controller2@-1(probing) e16  ready to join, but i'm not in the monmap/my addr is blank/location is wrong, trying to join
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20 mon.controller2@-1(probing) e16 _ms_dispatch existing session 0x55c46f8d4b40 for mon.1
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20 mon.controller2@-1(probing) e16  entity_name  global_id 0 (none) caps allow *
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20 is_capable service=mon command= read addr v2:192.168.9.209:3300/0 on cap allow *
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20  allow so far , doing grant allow *
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 20  allow all
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 mon.controller2@-1(probing) e16 handle_probe mon_probe(reply 9d036488-fb4f-4e5b-85ec-4ccf75501b48 name controller4 quorum 0,1,2 leader 0 paxos( fc 133912517 lc 133913211 ) mon_release pacific) v8
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 mon.controller2@-1(probing) e16 handle_probe_reply mon.1 v2:192.168.9.209:3300/0 mon_probe(reply 9d036488-fb4f-4e5b-85ec-4ccf75501b48 name controller4 quorum 0,1,2 leader 0 paxos( fc 133912517 lc 133913211 ) mon_release pacific) v8
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 mon.controller2@-1(probing) e16  monmap is e16: 3 mons at {controller1=[v2:192.168.9.206:3300/0,v1:192.168.9.206:6789/0],controller4=[v2:192.168.9.209:3300/0,v1:192.168.9.209:6789/0],controller5=[v2:192.168.9.210:3300/0,v1:192.168.9.210:6789/0]}
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 mon.controller2@-1(probing) e16  peer name is controller4
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 mon.controller2@-1(probing) e16  existing quorum 0,1,2
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 mon.controller2@-1(probing) e16  peer paxos version 133913211 vs my version 133913204 (ok)
> debug 2022-03-29T11:10:53.695+0000 7f81be00c700 10 mon.controller2@-1(probing) e16  ready to join, but i'm not in the monmap/my addr is blank/location is wrong, trying to join
> debug 2022-03-29T11:10:54.453+0000 7f81c2014700 10 mon.controller2@-1(probing) e16 get_authorizer for mgr
> debug 2022-03-29T11:10:55.453+0000 7f81c2014700 10 mon.controller2@-1(probing) e16 get_authorizer for mgr
> debug 2022-03-29T11:10:55.695+0000 7f81c0811700  4 mon.controller2@-1(probing) e16 probe_timeout 0x55c46fbb8d80
> debug 2022-03-29T11:10:55.695+0000 7f81c0811700 10 mon.controller2@-1(probing) e16 bootstrap
>
> Kind Regards,
> Thomas Bruckmann
> Systemadministrator Cloud Dienste
> E
> Thomas.Bruckmann@xxxxxxxxxxxxx<mailto:%20Thomas.Bruckmann@xxxxxxxxxxxxx;>
> softgarden e-recruiting GmbH
> Tauentzienstraße 14 | 10789 Berlin
> https://softgarden.de/
> Gesellschaft mit beschränkter Haftung, Amtsgericht Berlin-Charlottenburg
> HRB 114159 B | USt-ID: DE260440441 | Geschäftsführer: Mathias Heese, Stefan Schüffler, Claus Müller
>
>
> Von: Thomas Bruckmann <Thomas.Bruckmann@xxxxxxxxxxxxx>
> Datum: Donnerstag, 24. März 2022 um 17:06
> An: ceph-users@xxxxxxx <ceph-users@xxxxxxx>
> Betreff:  Ceph Mon not able to authenticate
> Hello,
> We are running ceph 16.2.6 and having trouble with our mon’s everything is managed via ceph orch and running in containers. Since we switched our firewall in the DC (which also makes DNS) our ceph mon daemons are not able to authenticate when they are restarted.
>
> The errormessage in the monitor log is:
>
> debug 2022-03-24T14:25:12.716+0000 7fa0dc2df700 1 mon.2@-1(probing) e13 handle_auth_request failed to assign global_id
>
> What we already tried to solve the problem:
>
>  *   Removed the mon fully from the node (including all artifacts in the FS)
>  *   Doublechecked if the mon is still in the monmap after removing it (it is not)
>  *   Added other mons (which were previously no mons) to ensure a unique and synced monmap and tried adding the failing mon -> no success
>  *   Shutted down a running mon (no one of the brand new) and tried bringing it up again -> same error
>
> It seems not to be an error with the monmap, however manipulating the monmap manually is currently not possible, since the system is prod and we cannot shutdown the whole FS.
>
> Another Blogpost, I do not find the link anymore, say the problem could be related to the dns resolution somehow, that may the dns name behind the IP has changed. For each of our initial mons, we have 3 different DNS names, which are returned on a reverse lookup, since we switched the Firewall, may to order those names are returned has changed. Don’t know if this could be to problem.
>
> Does may anyone has an Idea how to solve the Problem?
>
> Kind Regards,
> Thomas Bruckmann
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux