I bumped debug mon and debug ms up on one of the monitors (ceph0), and this is what I see:
2014-03-25 16:02:19.273406 7fa5c53fc700 5 mon.ceph0@0(electing).elector(35) election timer expired
2014-03-25 16:02:19.273447 7fa5c53fc700 5 mon.ceph0@0(electing).elector(35) start -- can i be leader?
2014-03-25 16:02:19.273528 7fa5c53fc700 1 mon.ceph0@0(electing).elector(35) init, last seen epoch 35
2014-03-25 16:02:19.273543 7fa5c53fc700 1 -- 10.10.30.0:6789/0 --> mon.1 10.10.30.1:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951 propose 35) v4 -- ?+0 0x251bd80
2014-03-25 16:02:19.273569 7fa5c53fc700 1 -- 10.10.30.0:6789/0 --> mon.2 10.10.30.2:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951 propose 35) v4 -- ?+0 0x251bb40
2014-03-25 16:02:20.284459 7fa5c4bfb700 1 -- 10.10.30.0:6789/0 <== mon.2 10.10.30.2:6789/0 312 ==== election(b3f38955-4321-4850-9ddb-3b09940dc951 propose 35) v4 ==== 537+0+0 (4036841703 0 0) 0x2515480 con 0x1df0c60
2014-03-25 16:02:20.284524 7fa5c4bfb700 5 mon.ceph0@0(electing).elector(35) handle_propose from mon.2
2014-03-25 16:02:24.273726 7fa5c53fc700 5 mon.ceph0@0(electing).elector(35) election timer expired
That just repeats...2014-03-25 16:02:19.273406 7fa5c53fc700 5 mon.ceph0@0(electing).elector(35) election timer expired
2014-03-25 16:02:19.273447 7fa5c53fc700 5 mon.ceph0@0(electing).elector(35) start -- can i be leader?
2014-03-25 16:02:19.273528 7fa5c53fc700 1 mon.ceph0@0(electing).elector(35) init, last seen epoch 35
2014-03-25 16:02:19.273543 7fa5c53fc700 1 -- 10.10.30.0:6789/0 --> mon.1 10.10.30.1:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951 propose 35) v4 -- ?+0 0x251bd80
2014-03-25 16:02:19.273569 7fa5c53fc700 1 -- 10.10.30.0:6789/0 --> mon.2 10.10.30.2:6789/0 -- election(b3f38955-4321-4850-9ddb-3b09940dc951 propose 35) v4 -- ?+0 0x251bb40
2014-03-25 16:02:20.284459 7fa5c4bfb700 1 -- 10.10.30.0:6789/0 <== mon.2 10.10.30.2:6789/0 312 ==== election(b3f38955-4321-4850-9ddb-3b09940dc951 propose 35) v4 ==== 537+0+0 (4036841703 0 0) 0x2515480 con 0x1df0c60
2014-03-25 16:02:20.284524 7fa5c4bfb700 5 mon.ceph0@0(electing).elector(35) handle_propose from mon.2
2014-03-25 16:02:24.273726 7fa5c53fc700 5 mon.ceph0@0(electing).elector(35) election timer expired
On Tue, Mar 25, 2014 at 11:48 AM, Travis Rhoden <trhoden@xxxxxxxxx> wrote:
Just to emphasize that I don't think it's clock skew, here is the NTP state of all three monitors:I think they are pretty well in synch.
# ansible ceph_mons -m command -a "ntpq -p" -kK
SSH password:
sudo password [defaults to SSH password]:
ceph0 | success | rc=0 >>
remote refid st t when poll reach delay offset jitter
==============================================================================
*controller-10g 198.60.73.8 2 u 43 64 377 0.236 0.057 0.097
ceph1 | success | rc=0 >>
remote refid st t when poll reach delay offset jitter
==============================================================================
*controller-10g 198.60.73.8 2 u 39 64 377 0.273 0.035 0.064
ceph2 | success | rc=0 >>
remote refid st t when poll reach delay offset jitter
==============================================================================
*controller-10g 198.60.73.8 2 u 30 64 377 0.201 -0.063 0.063
- TravisOn Tue, Mar 25, 2014 at 11:09 AM, Travis Rhoden <trhoden@xxxxxxxxx> wrote:- TravisAny ideas what I can look at besides NTP? The docs really stress that it should be clock skew, so I'll keep looking at that...They all look identical to that.Hello,Specifically, I"m in the stat described in this section: http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/#most-common-monitor-issues
I just deployed a new Emperor cluster using ceph-deploy 1.4. All went very smooth, until I rebooted all the nodes. After reboot, the monitors no longer form a quorum.
I followed the troubleshooting steps here: http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/
The state for all the monitors is "electing". The docs say this is most likely clock skew, but I do have all nodes synch'd with NTP. I've confirmed this multiple times. I've also confirmed the monitors can reach each other (by telneting to IP:PORT, and I can see established connections via netstat).
I'm baffled.
here is a sample mon_status output:
root@ceph0:~# ceph daemon mon.ceph0 quorum_status
{ "election_epoch": 31,
"quorum": [],
"quorum_names": [],
"quorum_leader_name": "",
"monmap": { "epoch": 2,
"fsid": "XXX", (redacted)
"modified": "2014-03-24 14:35:22.332646",
"created": "0.000000",
"mons": [
{ "rank": 0,
"name": "ceph0",
"addr": "10.10.30.0:6789\/0"},
{ "rank": 1,
"name": "ceph1",
"addr": "10.10.30.1:6789\/0"},
{ "rank": 2,
"name": "ceph2",
"addr": "10.10.30.2:6789\/0"}]}}
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com