Re: Monitor stuck at "probing"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Do you happen to be running on Debian Buster? I'm running into a similar
problem, though in my case I'm bootstrapping a new cluster using a manual
(well, automated by Ansible from the Manual Install guide) method. The very
first time I bootstrap it seems fine, then if I purge all the ceph-* packages,
configs, and data from all 3 nodes and redeploy, I get this issue. Though
in my case I don't see any actual cephx failures. Similarly I've verified both
my keyrings and monmaps are completely identical and contain the expected hosts,
verified network connectivity in the same method, and am basically at the same
point of not knowing where else to look or even what to look for.

Log snippit of the auth failure on my side:

2019-06-16 15:24:09.003504 7ff5cd3f8700 10 mon.pvchv1@0(probing) e1 ms_verify_authorizer 10.0.0.2:6789/0 mon protocol 2
2019-06-16 15:24:09.003531 7ff5cd3f8700 10 cephx: verify_authorizer decrypted service mon secret_id=18446744073709551615
2019-06-16 15:24:09.003698 7ff5cd3f8700 10 cephx: verify_authorizer global_id=0
2019-06-16 15:24:09.003760 7ff5cd3f8700 10 cephx: cephx_verify_authorizer adding server_challenge 3886107952318386586
2019-06-16 15:24:09.003793 7ff5cd3f8700  0 mon.pvchv1@0(probing) e1 ms_verify_authorizer bad authorizer from mon 10.0.0.2:6789/0
2019-06-16 15:24:09.004148 7ff5cd3f8700 10 mon.pvchv1@0(probing) e1 ms_verify_authorizer 10.0.0.2:6789/0 mon protocol 2
2019-06-16 15:24:09.004173 7ff5cd3f8700 10 cephx: verify_authorizer decrypted service mon secret_id=18446744073709551615
2019-06-16 15:24:09.004221 7ff5cd3f8700 10 cephx: verify_authorizer global_id=0
2019-06-16 15:24:09.004230 7ff5cd3f8700 10 cephx: cephx_verify_authorizer got server_challenge+1 3886107952318386587 expecting 3886107952318386587
2019-06-16 15:24:09.004242 7ff5cd3f8700 10 cephx: verify_authorizer ok nonce 569bd961dcb99aca reply_bl.length()=36

Thanks,
Joshua

On 2019-06-14 10:40 p.m., ☣Adam wrote:
I have a monitor which I just can't seem to get to join the quorum, even
after injecting a monmap from one of the other servers.[1]  I use NTP on
all servers and also manually verified the clocks are synchronized.


My monitors are named: ceph0, ceph2, xe, and tc

I'm transitioning away from the ceph# naming scheme, so please forgive
the confusing [lack of a] naming convention.


The relevant output from: ceph -s
1/4 mons down, quorum ceph0,ceph2,xe
mon: 4 daemons, quorum ceph0,ceph2,xe, out of quorum: tc


tc is up, bound to the expected IP address, and the ceph-mon service can
be reached from xe, ceph0 and ceph2 using telnet.  The mon_host and
mon_initial_members from `ceph daemon mon.tc config show` look correct.

mon_status on tc shows the state as "probing" and the list of
"extra_probe_peers" looks correct (correct IP addresses, and ports).
However the monmap section looks wrong.  The "mons" has all 4 servers,
but the addr and public_addr values are 0.0.0.0:0.  Furthermore it says
the monmap epoch is 4.  I don't understand why because I just injected a
monmap which has an epoch of 7.

Here's the output of: monmaptool --print ./monmap
monmaptool: monmap file ./monmap
epoch 7
fsid a690e404-3152-4804-a960-8b52abf3bd65
last_changed 2019-06-02 17:38:50.161035
created 2018-12-28 20:26:41.443339
0: 192.168.60.10:6789/0 mon.ceph0
1: 192.168.60.11:6789/0 mon.tc
2: 192.168.60.12:6789/0 mon.ceph2
3: 192.168.60.53:6789/0 mon.xe

When I injected it, I stopped ceph-mon, ran:
sudo ceph-mon -i tc --inject-monmap ./monmap

and started ceph-mon again.  I then rebooted to see if it would fix this
epoch/addr issue.  It did not.

I'm attaching what I believe is the relevant section of my log file from
the tc monitor.  I ran `ceph auth list` on tc and ceph2 and verified
that the output is identical.  This check was based on what I saw in the
log and what I read in a blog post.[2]

What are the next steps in troubleshooting this issue?


Thanks,
Adam


[1]
http://docs.ceph.com/docs/jewel/rados/troubleshooting/troubleshooting-mon/
[2]
https://medium.com/@george.shuklin/silly-mistakes-with-ceph-mon-9ef6c9eaab54

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux