Monitors stuck in "electing"

Travis Rhoden <trhoden@xxxxxxxxx> · Tue, 25 Mar 2014 11:09:13 -0400

Hello,

I just deployed a new Emperor cluster using ceph-deploy 1.4.  All went very smooth, until I rebooted all the nodes.  After reboot, the monitors no longer form a quorum.

I followed the troubleshooting steps here: http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/

Specifically, I"m in the stat described in this section: http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/#most-common-monitor-issues

The state for all the monitors is "electing".  The docs say this is most likely clock skew, but I do have all nodes synch'd with NTP.  I've confirmed this multiple times.  I've also confirmed the monitors can reach each other (by telneting to IP:PORT, and I can see established connections via netstat).

I'm baffled.

here is a sample mon_status output:

root@ceph0:~# ceph daemon mon.ceph0 quorum_status
{ "election_epoch": 31,
  "quorum": [],
  "quorum_names": [],

  "quorum_leader_name": "",
  "monmap": { "epoch": 2,
      "fsid": "XXX", (redacted)
      "modified": "2014-03-24 14:35:22.332646",

      "created": "0.000000",
      "mons": [
            { "rank": 0,
              "name": "ceph0",
              "addr": "10.10.30.0:6789\/0"},

            { "rank": 1,
              "name": "ceph1",
              "addr": "10.10.30.1:6789\/0"},
            { "rank": 2,

              "name": "ceph2",
              "addr": "10.10.30.2:6789\/0"}]}}

They all look identical to that.

Any ideas what I can look at besides NTP?  The docs really stress that it should be clock skew, so I'll keep looking at that...

 - Travis

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com