Monitor troubles

Tracy Reed <treed@xxxxxxxxxxxxxxx> · Tue, 1 Nov 2016 21:36:16 -0700

I initially setup my ceph cluster on CentOS 7 with just one monitor. The
monitor runs on an osd server (not ideal, will change soon).  I've
tested it quite a lot over the last couple of months and things have
gone well. I knew I needed to add a couple more monitors so I did the
following:

ceph-deploy mon create ceph02

And then the cluster hung. I did some googling and found some things
which said I need to add a public network etc. I did so and restarted
the mons. No luck. I also added them to mon_initial_members and
mon_host. My current ceph.conf looks like this:

[global]
osd pool default size = 2
fsid = e2e43abc-e634-4a04-ae24-0c486a035b6e
mon_initial_members = ceph01,ceph02
mon_host = 10.0.5.2,10.0.5.3
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
# All mons/osds are on 10.0.5.0 but deploy-server is on 10.0.10.0. I
# expect this second subnet is unnecessary to list here but thought it
# couldn't hurt. None of the mons/osds have a 10.0.10.0 interface so
# there can't be confusion, right?
public_network = 10.0.5.0/24,10.0.10.0/24 

[client]
rbd default features = 1

I then discovered and started following:
http://docs.ceph.com/docs/jewel/rados/troubleshooting/troubleshooting-mon/

Are the monitors running? Yes

Are you able to connect to the monitor’s servers? Yes

Does ceph -s run and obtain a reply from the cluster? No

What if ceph -s doesn’t finish? It says try "ceph ping mon.ID"

[ceph-deploy@ceph-deploy my-cluster]$ ceph ping mon.ceph01
Error connecting to cluster: ObjectNotFound

Then it suggests trying the monitor admin socket. This works:

[root@ceph01 ~]# ceph daemon mon.ceph01 mon_status                                                                                                                     
{
    "name": "ceph01",
    "rank": 0,
    "state": "probing",
    "election_epoch": 0,
    "quorum": [],
    "outside_quorum": [
        "ceph01"
    ],
    "extra_probe_peers": [],
    "sync_provider": [],
    "monmap": {
        "epoch": 2,
        "fsid": "3e84db5d-3dc8-4104-89e7-da23c103ef50",
        "modified": "2016-11-01 19:55:28.083057",
        "created": "2016-09-05 01:22:09.228315",
        "mons": [
            {
                "rank": 0,
                "name": "ceph01",
                "addr": "10.0.5.2:6789\/0"
            },
            {
                "rank": 1,
                "name": "ceph02",
                "addr": "10.0.5.3:6789\/0"
            }
        ]
    }
}

[root@ceph02 ~]# ceph daemon mon.ceph02 mon_status
{
    "name": "ceph02",
    "rank": 0,
    "state": "probing",
    "election_epoch": 0,
    "quorum": [],
    "outside_quorum": [
        "ceph02"
    ],
    "extra_probe_peers": [
        "10.0.5.2:6789\/0"
    ],
    "sync_provider": [],
    "monmap": {
        "epoch": 0,
        "fsid": "e2e43abc-e634-4a04-ae24-0c486a035b6e",
        "modified": "2016-11-01 19:33:06.242314",
        "created": "2016-11-01 19:33:06.242314",
        "mons": [
            {
                "rank": 0,
                "name": "ceph02",
                "addr": "10.0.5.3:6789\/0"
            },
            {
                "rank": 1,
                "name": "ceph01",
                "addr": "0.0.0.0:0\/1"
            }
        ]
    }
}

So they are both in probing state, they each say they are
outside_quorum, and ceph02 shows addr 0.0.0.0 for ceph01. I tried
telling ceph02 the address of ceph01 using "ceph daemon mon.ceph02
add_bootstrap_peer_hint 10.0.5.2" which is why it appears in
extra_probe_peers. It does not seem to have helped. I notice the fsid's
are different in the mon_status output. No idea why. The proper cluster
fsid is e2e43abc-e634-4a04-ae24-0c486a035b6e. Could this be what is
messing things up? ceph01 is the original monitor. What's weird though
is that I see from weeks ago when I first setup that cluster that fsid
appears in the deployment log:

[2016-10-05 14:48:51,811][ceph01][INFO  ] Running command: sudo systemctl enable ceph.target
[2016-10-05 14:48:51,946][ceph01][INFO  ] Running command: sudo systemctl enable ceph-mon@ceph01
[2016-10-05 14:48:52,073][ceph01][INFO  ] Running command: sudo systemctl start ceph-mon@ceph01
[2016-10-05 14:48:54,104][ceph01][INFO  ] Running command: sudo ceph --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.ceph01.asok mon_status
[2016-10-05 14:48:54,272][ceph01][DEBUG ] ********************************************************************************
[2016-10-05 14:48:54,273][ceph01][DEBUG ] status for monitor: mon.ceph01
[2016-10-05 14:48:54,274][ceph01][DEBUG ] {
[2016-10-05 14:48:54,275][ceph01][DEBUG ]   "election_epoch": 5, 
[2016-10-05 14:48:54,275][ceph01][DEBUG ]   "extra_probe_peers": [], 
[2016-10-05 14:48:54,275][ceph01][DEBUG ]   "monmap": {
[2016-10-05 14:48:54,276][ceph01][DEBUG ]     "created": "2016-09-05 01:22:09.228315", 
[2016-10-05 14:48:54,276][ceph01][DEBUG ]     "epoch": 1, 
[2016-10-05 14:48:54,276][ceph01][DEBUG ]     "fsid": "3e84db5d-3dc8-4104-89e7-da23c103ef50", 
[2016-10-05 14:48:54,276][ceph01][DEBUG ]     "modified": "2016-09-05 01:22:09.228315", 
[2016-10-05 14:48:54,277][ceph01][DEBUG ]     "mons": [
[2016-10-05 14:48:54,277][ceph01][DEBUG ]       {
[2016-10-05 14:48:54,277][ceph01][DEBUG ]         "addr": "10.0.5.2:6789/0", 
[2016-10-05 14:48:54,277][ceph01][DEBUG ]         "name": "ceph01", 
[2016-10-05 14:48:54,278][ceph01][DEBUG ]         "rank": 0
[2016-10-05 14:48:54,278][ceph01][DEBUG ]       }
[2016-10-05 14:48:54,279][ceph01][DEBUG ]     ]
[2016-10-05 14:48:54,279][ceph01][DEBUG ]   }, 
[2016-10-05 14:48:54,280][ceph01][DEBUG ]   "name": "ceph01", 
[2016-10-05 14:48:54,280][ceph01][DEBUG ]   "outside_quorum": [], 
[2016-10-05 14:48:54,281][ceph01][DEBUG ]   "quorum": [
[2016-10-05 14:48:54,282][ceph01][DEBUG ]     0
[2016-10-05 14:48:54,282][ceph01][DEBUG ]   ], 
[2016-10-05 14:48:54,282][ceph01][DEBUG ]   "rank": 0, 
[2016-10-05 14:48:54,282][ceph01][DEBUG ]   "state": "leader", 
[2016-10-05 14:48:54,282][ceph01][DEBUG ]   "sync_provider": []
[2016-10-05 14:48:54,283][ceph01][DEBUG ] }
[2016-10-05 14:48:54,283][ceph01][DEBUG ] ********************************************************************************
[2016-10-05 14:48:54,283][ceph01][INFO  ] monitor: mon.ceph01 is running

But the cluster worked just fine until I tried adding two more monitors.

In the troubleshooting section "Recovering a Monitor’s Broken monmap" I
thought maybe I would try extracting a monmap with the idea that maybe I
would learn something or possibly change the fsid on ceph01 or
something.

[root@ceph01 ~]# ceph-mon -i mon.ceph01 --extract-monmap /tmp/monmap
monitor data directory at '/var/lib/ceph/mon/ceph-mon.ceph01' does not
exist: have you run 'mkfs'?

So that didn't get me anything either.

mon log on ceph01 contains repetitions of:
2016-11-01 21:34:33.588396 7ff029c70700  0 mon.ceph01@0(probing) e2 handle_probe ignoring fsid e2e43abc-e634-4a04-ae24-0c486a035b6e != 3e84db5d-3dc8-4104-89e7-da23c103ef50
2016-11-01 21:34:35.739479 7ff029c70700  0 mon.ceph01@0(probing) e2 handle_probe ignoring fsid e2e43abc-e634-4a04-ae24-0c486a035b6e != 3e84db5d-3dc8-4104-89e7-da23c103ef50
2016-11-01 21:34:35.936020 7ff024f3f700  0 -- 10.0.5.2:6789/0 >> 10.0.5.5:0/3093707402 pipe(0x7ff03d57e800 sd=20 :6789 s=0 pgs=0 cs=0 l=0 c=0x7ff03d81e580).accept peer addr is really 10.0.5.5:0/3093707402 (socket is 10.0.5.5:44360/0)
2016-11-01 21:34:37.890073 7ff029c70700  0 mon.ceph01@0(probing) e2 handle_probe ignoring fsid e2e43abc-e634-4a04-ae24-0c486a035b6e != 3e84db5d-3dc8-4104-89e7-da23c103ef50
2016-11-01 21:34:40.043113 7ff029c70700  0 mon.ceph01@0(probing) e2 handle_probe ignoring fsid e2e43abc-e634-4a04-ae24-0c486a035b6e != 3e84db5d-3dc8-4104-89e7-da23c103ef50
2016-11-01 21:34:40.554165 7ff02a471700  0 mon.ceph01@0(probing).data_health(0) update_stats avail 96% total 51175 MB, used 1850 MB, avail 49324 MB

while mon log on ceph02 contains repetitions of:
2016-11-01 21:34:11.327458 7f33f4284700  0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch
2016-11-01 21:34:11.327623 7f33f4284700  0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished
2016-11-01 21:34:12.451514 7f33f4284700  0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch
2016-11-01 21:34:12.451683 7f33f4284700  0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished
2016-11-01 21:34:12.780988 7f33f1715700  0 mon.ceph02@0(probing) e0 handle_probe ignoring fsid 3e84db5d-3dc8-4104-89e7-da23c103ef50 != e2e43abc-e634-4a04-ae24-0c486a035b6e

Any ideas how to recover from this situation are greatly appreciated!

-- 
Tracy Reed
Attachment:
signature.asc

Description: PGP signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com