Hello Joao,
Thanks for your help. I increased logging on the failed monitor and noticed a lot of cephx authentication errors. After verifying ntp sync, I noticed that the monitor keyring deployed on working monitors differed from what was stored in the management server’s ceph.mon.keyring. Syncing the key and redeploying monitors got them to peer and establish quorum.
Joao,
Please see below. I think you’re totally right on:
I suspect they may already have this monitor in their map, but either with a different name or a different address -- and are thus ignoring probes from a peer that does not match what they are expecting.
# ceph --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.smg01.asok mon_status { "name": "smg01", "rank": 0, "state": "probing", "election_epoch": 0, "quorum": [], "outside_quorum": [ "smg01" ], "extra_probe_peers": [ "10.20.1.8:6789\/0", "10.20.10.251:6789\/0", "10.20.10.252:6789\/0" ], "sync_provider": [], "monmap": { "epoch": 0, "fsid": "693834c1-1f95-4237-ab97-a767b0c0e6e7", "modified": "0.000000", "created": "0.000000", "mons": [ { "rank": 0, "name": "smg01", "addr": "10.20.10.250:6789\/0" }, { "rank": 1, "name": "smon01s", "addr": "0.0.0.0:0\/1" }, { "rank": 2, "name": "smon02s", "addr": "0.0.0.0:0\/2" }, { "rank": 3, "name": "b02s08", "addr": "0.0.0.0:0\/3" } ] } }
# ceph --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.smon01.asok mon_status { "name": "smon01", "rank": 1, "state": "peon", "election_epoch": 2702, "quorum": [ 0, 1, 2 ], "outside_quorum": [], "extra_probe_peers": [], "sync_provider": [], "monmap": { "epoch": 12, "fsid": "693834c1-1f95-4237-ab97-a767b0c0e6e7", "modified": "2015-12-09 06:23:43.665100", "created": "0.000000", "mons": [ { "rank": 0, "name": "b02s08", "addr": "10.20.1.8:6789\/0" }, { "rank": 1, "name": "smon01", "addr": "10.20.10.251:6789\/0" }, { "rank": 2, "name": "smon02", "addr": "10.20.10.252:6789\/0" } ] } }
# ceph --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.smon02.asok mon_status { "name": "smon02", "rank": 2, "state": "peon", "election_epoch": 2702, "quorum": [ 0, 1, 2 ], "outside_quorum": [], "extra_probe_peers": [], "sync_provider": [], "monmap": { "epoch": 12, "fsid": "693834c1-1f95-4237-ab97-a767b0c0e6e7", "modified": "2015-12-09 06:23:43.665100", "created": "0.000000", "mons": [ { "rank": 0, "name": "b02s08", "addr": "10.20.1.8:6789\/0" }, { "rank": 1, "name": "smon01", "addr": "10.20.10.251:6789\/0" }, { "rank": 2, "name": "smon02", "addr": "10.20.10.252:6789\/0" } ] } }
# ceph --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.b02s08.asok mon_status { "name": "b02s08", "rank": 0, "state": "leader", "election_epoch": 2702, "quorum": [ 0, 1, 2 ], "outside_quorum": [], "extra_probe_peers": [], "sync_provider": [], "monmap": { "epoch": 12, "fsid": "693834c1-1f95-4237-ab97-a767b0c0e6e7", "modified": "2015-12-09 06:23:43.665100", "created": "0.000000", "mons": [ { "rank": 0, "name": "b02s08", "addr": "10.20.1.8:6789\/0" }, { "rank": 1, "name": "smon01", "addr": "10.20.10.251:6789\/0" }, { "rank": 2, "name": "smon02", "addr": "10.20.10.252:6789\/0" } ] } }
On Dec 14, 2015, at 04:56 , Joao Eduardo Luis < joao@xxxxxxx> wrote:
On 12/14/2015 12:41 AM, deeepdish wrote: Perhaps I’m not understanding something..
The “extra_probe_peers” ARE the other working monitors in quorum out of the mon_host line in ceph.conf.
In the example below 10.20.1.8 = b20s08; 10.20.10.251 = smon01s; 10.20.10.252 = smon02s
The monitor is not reaching out to the other IPs and syncing. I’m able to ping all IPs in the extra_probe_peers list.
Okay, so that means the other monitors are, for some reason, ignoring the probes from this monitor. Can you please show the result of mon_status from the monitors in the quorum? I suspect they may already have this monitor in their map, but either with a different name or a different address -- and are thus ignoring probes from a peer that does not match what they are expecting. -Joao
|