On Wed, Sep 17, 2014 at 1:58 PM, James Eckersall <james.eckersall at gmail.com> wrote: > Hi, > > I have a ceph cluster running 0.80.1 on Ubuntu 14.04. I have 3 monitors and > 4 OSD nodes currently. > > Everything has been running great up until today where I've got an issue > with the monitors. > I moved mon03 to a different switchport so it would have temporarily lost > connectivity. > Since then, the cluster is reporting that that mon is down, although it's > definitely up. > I've tried restarting the mon services on all three mons, but that hasn't > made a difference. > I definitely, 100% do not have any clock skew on any of the mons. This has > been triple-checked as the ceph docs seem to suggest that might be the cause > of this issue. > > Here is what ceph -s and ceph health detail are reporting as well as the > mon_status for each monitor: > > > # ceph -s ; ceph health detail > cluster XXX > health HEALTH_WARN 1 mons down, quorum 0,1 ceph-mon-01,ceph-mon-02 > monmap e2: 3 mons at > {ceph-mon-01=10.1.1.64:6789/0,ceph-mon-02=10.1.1.65:6789/0,ceph-mon-03=10.1.1.66:6789/0}, > election epoch 932, quorum 0,1 ceph-mon-01,ceph-mon-02 > osdmap e49213: 80 osds: 80 up, 80 in > pgmap v18242952: 4864 pgs, 5 pools, 69910 GB data, 17638 kobjects > 197 TB used, 95904 GB / 290 TB avail > 8 active+clean+scrubbing+deep > 4856 active+clean > client io 6893 kB/s rd, 5657 kB/s wr, 2090 op/s > HEALTH_WARN 1 mons down, quorum 0,1 ceph-mon-01,ceph-mon-02 > mon.ceph-mon-03 (rank 2) addr 10.1.1.66:6789/0 is down (out of quorum) > > > { "name": "ceph-mon-01", > "rank": 0, > "state": "leader", > "election_epoch": 932, > "quorum": [ > 0, > 1], > "outside_quorum": [], > "extra_probe_peers": [], > "sync_provider": [], > "monmap": { "epoch": 2, > "fsid": "XXX", > "modified": "0.000000", > "created": "0.000000", > "mons": [ > { "rank": 0, > "name": "ceph-mon-01", > "addr": "10.1.1.64:6789\/0"}, > { "rank": 1, > "name": "ceph-mon-02", > "addr": "10.1.1.65:6789\/0"}, > { "rank": 2, > "name": "ceph-mon-03", > "addr": "10.1.1.66:6789\/0"}]}} > > > { "name": "ceph-mon-02", > "rank": 1, > "state": "peon", > "election_epoch": 932, > "quorum": [ > 0, > 1], > "outside_quorum": [], > "extra_probe_peers": [], > "sync_provider": [], > "monmap": { "epoch": 2, > "fsid": "XXX", > "modified": "0.000000", > "created": "0.000000", > "mons": [ > { "rank": 0, > "name": "ceph-mon-01", > "addr": "10.1.1.64:6789\/0"}, > { "rank": 1, > "name": "ceph-mon-02", > "addr": "10.1.1.65:6789\/0"}, > { "rank": 2, > "name": "ceph-mon-03", > "addr": "10.1.1.66:6789\/0"}]}} > > > { "name": "ceph-mon-03", > "rank": 2, > "state": "electing", > "election_epoch": 931, > "quorum": [], > "outside_quorum": [], > "extra_probe_peers": [], > "sync_provider": [], > "monmap": { "epoch": 2, > "fsid": "XXX", > "modified": "0.000000", > "created": "0.000000", > "mons": [ > { "rank": 0, > "name": "ceph-mon-01", > "addr": "10.1.1.64:6789\/0"}, > { "rank": 1, > "name": "ceph-mon-02", > "addr": "10.1.1.65:6789\/0"}, > { "rank": 2, > "name": "ceph-mon-03", > "addr": "10.1.1.66:6789\/0"}]}} > > > Any help or advice is appreciated. It looks like your mon has been unable to communicate with the other hosts, presumably since the time you un-/replugged it. Check your switch port configuration. Also, make sure that from 10.1.1.66, you can not only ping 10.1.1.64 and 10.1.1.65, but make a TCP connection on port 6789. With that out of the way, check your mon log on ceph-mon-03 (in /var/log/ceph/mon); it should provide some additional insight into the problem. Cheers, Florian