I initially setup my ceph cluster on CentOS 7 with just one monitor. The monitor runs on an osd server (not ideal, will change soon). I've tested it quite a lot over the last couple of months and things have gone well. I knew I needed to add a couple more monitors so I did the following: ceph-deploy mon create ceph02 And then the cluster hung. I did some googling and found some things which said I need to add a public network etc. I did so and restarted the mons. No luck. I also added them to mon_initial_members and mon_host. My current ceph.conf looks like this: [global] osd pool default size = 2 fsid = e2e43abc-e634-4a04-ae24-0c486a035b6e mon_initial_members = ceph01,ceph02 mon_host = 10.0.5.2,10.0.5.3 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx # All mons/osds are on 10.0.5.0 but deploy-server is on 10.0.10.0. I # expect this second subnet is unnecessary to list here but thought it # couldn't hurt. None of the mons/osds have a 10.0.10.0 interface so # there can't be confusion, right? public_network = 10.0.5.0/24,10.0.10.0/24 [client] rbd default features = 1 I then discovered and started following: http://docs.ceph.com/docs/jewel/rados/troubleshooting/troubleshooting-mon/ Are the monitors running? Yes Are you able to connect to the monitor’s servers? Yes Does ceph -s run and obtain a reply from the cluster? No What if ceph -s doesn’t finish? It says try "ceph ping mon.ID" [ceph-deploy@ceph-deploy my-cluster]$ ceph ping mon.ceph01 Error connecting to cluster: ObjectNotFound Then it suggests trying the monitor admin socket. This works: [root@ceph01 ~]# ceph daemon mon.ceph01 mon_status { "name": "ceph01", "rank": 0, "state": "probing", "election_epoch": 0, "quorum": [], "outside_quorum": [ "ceph01" ], "extra_probe_peers": [], "sync_provider": [], "monmap": { "epoch": 2, "fsid": "3e84db5d-3dc8-4104-89e7-da23c103ef50", "modified": "2016-11-01 19:55:28.083057", "created": "2016-09-05 01:22:09.228315", "mons": [ { "rank": 0, "name": "ceph01", "addr": "10.0.5.2:6789\/0" }, { "rank": 1, "name": "ceph02", "addr": "10.0.5.3:6789\/0" } ] } } [root@ceph02 ~]# ceph daemon mon.ceph02 mon_status { "name": "ceph02", "rank": 0, "state": "probing", "election_epoch": 0, "quorum": [], "outside_quorum": [ "ceph02" ], "extra_probe_peers": [ "10.0.5.2:6789\/0" ], "sync_provider": [], "monmap": { "epoch": 0, "fsid": "e2e43abc-e634-4a04-ae24-0c486a035b6e", "modified": "2016-11-01 19:33:06.242314", "created": "2016-11-01 19:33:06.242314", "mons": [ { "rank": 0, "name": "ceph02", "addr": "10.0.5.3:6789\/0" }, { "rank": 1, "name": "ceph01", "addr": "0.0.0.0:0\/1" } ] } } So they are both in probing state, they each say they are outside_quorum, and ceph02 shows addr 0.0.0.0 for ceph01. I tried telling ceph02 the address of ceph01 using "ceph daemon mon.ceph02 add_bootstrap_peer_hint 10.0.5.2" which is why it appears in extra_probe_peers. It does not seem to have helped. I notice the fsid's are different in the mon_status output. No idea why. The proper cluster fsid is e2e43abc-e634-4a04-ae24-0c486a035b6e. Could this be what is messing things up? ceph01 is the original monitor. What's weird though is that I see from weeks ago when I first setup that cluster that fsid appears in the deployment log: [2016-10-05 14:48:51,811][ceph01][INFO ] Running command: sudo systemctl enable ceph.target [2016-10-05 14:48:51,946][ceph01][INFO ] Running command: sudo systemctl enable ceph-mon@ceph01 [2016-10-05 14:48:52,073][ceph01][INFO ] Running command: sudo systemctl start ceph-mon@ceph01 [2016-10-05 14:48:54,104][ceph01][INFO ] Running command: sudo ceph --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.ceph01.asok mon_status [2016-10-05 14:48:54,272][ceph01][DEBUG ] ******************************************************************************** [2016-10-05 14:48:54,273][ceph01][DEBUG ] status for monitor: mon.ceph01 [2016-10-05 14:48:54,274][ceph01][DEBUG ] { [2016-10-05 14:48:54,275][ceph01][DEBUG ] "election_epoch": 5, [2016-10-05 14:48:54,275][ceph01][DEBUG ] "extra_probe_peers": [], [2016-10-05 14:48:54,275][ceph01][DEBUG ] "monmap": { [2016-10-05 14:48:54,276][ceph01][DEBUG ] "created": "2016-09-05 01:22:09.228315", [2016-10-05 14:48:54,276][ceph01][DEBUG ] "epoch": 1, [2016-10-05 14:48:54,276][ceph01][DEBUG ] "fsid": "3e84db5d-3dc8-4104-89e7-da23c103ef50", [2016-10-05 14:48:54,276][ceph01][DEBUG ] "modified": "2016-09-05 01:22:09.228315", [2016-10-05 14:48:54,277][ceph01][DEBUG ] "mons": [ [2016-10-05 14:48:54,277][ceph01][DEBUG ] { [2016-10-05 14:48:54,277][ceph01][DEBUG ] "addr": "10.0.5.2:6789/0", [2016-10-05 14:48:54,277][ceph01][DEBUG ] "name": "ceph01", [2016-10-05 14:48:54,278][ceph01][DEBUG ] "rank": 0 [2016-10-05 14:48:54,278][ceph01][DEBUG ] } [2016-10-05 14:48:54,279][ceph01][DEBUG ] ] [2016-10-05 14:48:54,279][ceph01][DEBUG ] }, [2016-10-05 14:48:54,280][ceph01][DEBUG ] "name": "ceph01", [2016-10-05 14:48:54,280][ceph01][DEBUG ] "outside_quorum": [], [2016-10-05 14:48:54,281][ceph01][DEBUG ] "quorum": [ [2016-10-05 14:48:54,282][ceph01][DEBUG ] 0 [2016-10-05 14:48:54,282][ceph01][DEBUG ] ], [2016-10-05 14:48:54,282][ceph01][DEBUG ] "rank": 0, [2016-10-05 14:48:54,282][ceph01][DEBUG ] "state": "leader", [2016-10-05 14:48:54,282][ceph01][DEBUG ] "sync_provider": [] [2016-10-05 14:48:54,283][ceph01][DEBUG ] } [2016-10-05 14:48:54,283][ceph01][DEBUG ] ******************************************************************************** [2016-10-05 14:48:54,283][ceph01][INFO ] monitor: mon.ceph01 is running But the cluster worked just fine until I tried adding two more monitors. In the troubleshooting section "Recovering a Monitor’s Broken monmap" I thought maybe I would try extracting a monmap with the idea that maybe I would learn something or possibly change the fsid on ceph01 or something. [root@ceph01 ~]# ceph-mon -i mon.ceph01 --extract-monmap /tmp/monmap monitor data directory at '/var/lib/ceph/mon/ceph-mon.ceph01' does not exist: have you run 'mkfs'? So that didn't get me anything either. mon log on ceph01 contains repetitions of: 2016-11-01 21:34:33.588396 7ff029c70700 0 mon.ceph01@0(probing) e2 handle_probe ignoring fsid e2e43abc-e634-4a04-ae24-0c486a035b6e != 3e84db5d-3dc8-4104-89e7-da23c103ef50 2016-11-01 21:34:35.739479 7ff029c70700 0 mon.ceph01@0(probing) e2 handle_probe ignoring fsid e2e43abc-e634-4a04-ae24-0c486a035b6e != 3e84db5d-3dc8-4104-89e7-da23c103ef50 2016-11-01 21:34:35.936020 7ff024f3f700 0 -- 10.0.5.2:6789/0 >> 10.0.5.5:0/3093707402 pipe(0x7ff03d57e800 sd=20 :6789 s=0 pgs=0 cs=0 l=0 c=0x7ff03d81e580).accept peer addr is really 10.0.5.5:0/3093707402 (socket is 10.0.5.5:44360/0) 2016-11-01 21:34:37.890073 7ff029c70700 0 mon.ceph01@0(probing) e2 handle_probe ignoring fsid e2e43abc-e634-4a04-ae24-0c486a035b6e != 3e84db5d-3dc8-4104-89e7-da23c103ef50 2016-11-01 21:34:40.043113 7ff029c70700 0 mon.ceph01@0(probing) e2 handle_probe ignoring fsid e2e43abc-e634-4a04-ae24-0c486a035b6e != 3e84db5d-3dc8-4104-89e7-da23c103ef50 2016-11-01 21:34:40.554165 7ff02a471700 0 mon.ceph01@0(probing).data_health(0) update_stats avail 96% total 51175 MB, used 1850 MB, avail 49324 MB while mon log on ceph02 contains repetitions of: 2016-11-01 21:34:11.327458 7f33f4284700 0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch 2016-11-01 21:34:11.327623 7f33f4284700 0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished 2016-11-01 21:34:12.451514 7f33f4284700 0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch 2016-11-01 21:34:12.451683 7f33f4284700 0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished 2016-11-01 21:34:12.780988 7f33f1715700 0 mon.ceph02@0(probing) e0 handle_probe ignoring fsid 3e84db5d-3dc8-4104-89e7-da23c103ef50 != e2e43abc-e634-4a04-ae24-0c486a035b6e Any ideas how to recover from this situation are greatly appreciated! -- Tracy Reed
Attachment:
signature.asc
Description: PGP signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com