Just to emphasize that I don't think it's clock skew, here is the NTP state of all three monitors:
# ansible ceph_mons -m command -a "ntpq -p" -kK
SSH password:
sudo password [defaults to SSH password]:
ceph0 | success | rc=0 >>
remote refid st t when poll reach delay offset jitter
==============================================================================
*controller-10g 198.60.73.8 2 u 43 64 377 0.236 0.057 0.097
ceph1 | success | rc=0 >>
remote refid st t when poll reach delay offset jitter
==============================================================================
*controller-10g 198.60.73.8 2 u 39 64 377 0.273 0.035 0.064
ceph2 | success | rc=0 >>
remote refid st t when poll reach delay offset jitter
==============================================================================
*controller-10g 198.60.73.8 2 u 30 64 377 0.201 -0.063 0.063
I think they are pretty well in synch.# ansible ceph_mons -m command -a "ntpq -p" -kK
SSH password:
sudo password [defaults to SSH password]:
ceph0 | success | rc=0 >>
remote refid st t when poll reach delay offset jitter
==============================================================================
*controller-10g 198.60.73.8 2 u 43 64 377 0.236 0.057 0.097
ceph1 | success | rc=0 >>
remote refid st t when poll reach delay offset jitter
==============================================================================
*controller-10g 198.60.73.8 2 u 39 64 377 0.273 0.035 0.064
ceph2 | success | rc=0 >>
remote refid st t when poll reach delay offset jitter
==============================================================================
*controller-10g 198.60.73.8 2 u 30 64 377 0.201 -0.063 0.063
- Travis
On Tue, Mar 25, 2014 at 11:09 AM, Travis Rhoden <trhoden@xxxxxxxxx> wrote:
- TravisAny ideas what I can look at besides NTP? The docs really stress that it should be clock skew, so I'll keep looking at that...They all look identical to that.Hello,Specifically, I"m in the stat described in this section: http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/#most-common-monitor-issues
I just deployed a new Emperor cluster using ceph-deploy 1.4. All went very smooth, until I rebooted all the nodes. After reboot, the monitors no longer form a quorum.
I followed the troubleshooting steps here: http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/
The state for all the monitors is "electing". The docs say this is most likely clock skew, but I do have all nodes synch'd with NTP. I've confirmed this multiple times. I've also confirmed the monitors can reach each other (by telneting to IP:PORT, and I can see established connections via netstat).
I'm baffled.
here is a sample mon_status output:
root@ceph0:~# ceph daemon mon.ceph0 quorum_status
{ "election_epoch": 31,
"quorum": [],
"quorum_names": [],
"quorum_leader_name": "",
"monmap": { "epoch": 2,
"fsid": "XXX", (redacted)
"modified": "2014-03-24 14:35:22.332646",
"created": "0.000000",
"mons": [
{ "rank": 0,
"name": "ceph0",
"addr": "10.10.30.0:6789\/0"},
{ "rank": 1,
"name": "ceph1",
"addr": "10.10.30.1:6789\/0"},
{ "rank": 2,
"name": "ceph2",
"addr": "10.10.30.2:6789\/0"}]}}
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com