monitor quorum

florian@xxxxxxxxxxx (Florian Haas) · Wed, 17 Sep 2014 17:04:25 +0200

On Wed, Sep 17, 2014 at 1:58 PM, James Eckersall
<james.eckersall at gmail.com> wrote:
> Hi,
>
> I have a ceph cluster running 0.80.1 on Ubuntu 14.04.  I have 3 monitors and
> 4 OSD nodes currently.
>
> Everything has been running great up until today where I've got an issue
> with the monitors.
> I moved mon03 to a different switchport so it would have temporarily lost
> connectivity.
> Since then, the cluster is reporting that that mon is down, although it's
> definitely up.
> I've tried restarting the mon services on all three mons, but that hasn't
> made a difference.
> I definitely, 100% do not have any clock skew on any of the mons.  This has
> been triple-checked as the ceph docs seem to suggest that might be the cause
> of this issue.
>
> Here is what ceph -s and ceph health detail are reporting as well as the
> mon_status for each monitor:
>
>
> # ceph -s ; ceph health detail
>     cluster XXX
>      health HEALTH_WARN 1 mons down, quorum 0,1 ceph-mon-01,ceph-mon-02
>      monmap e2: 3 mons at
> {ceph-mon-01=10.1.1.64:6789/0,ceph-mon-02=10.1.1.65:6789/0,ceph-mon-03=10.1.1.66:6789/0},
> election epoch 932, quorum 0,1 ceph-mon-01,ceph-mon-02
>      osdmap e49213: 80 osds: 80 up, 80 in
>       pgmap v18242952: 4864 pgs, 5 pools, 69910 GB data, 17638 kobjects
>             197 TB used, 95904 GB / 290 TB avail
>                    8 active+clean+scrubbing+deep
>                 4856 active+clean
>   client io 6893 kB/s rd, 5657 kB/s wr, 2090 op/s
> HEALTH_WARN 1 mons down, quorum 0,1 ceph-mon-01,ceph-mon-02
> mon.ceph-mon-03 (rank 2) addr 10.1.1.66:6789/0 is down (out of quorum)
>
>
> { "name": "ceph-mon-01",
>   "rank": 0,
>   "state": "leader",
>   "election_epoch": 932,
>   "quorum": [
>         0,
>         1],
>   "outside_quorum": [],
>   "extra_probe_peers": [],
>   "sync_provider": [],
>   "monmap": { "epoch": 2,
>       "fsid": "XXX",
>       "modified": "0.000000",
>       "created": "0.000000",
>       "mons": [
>             { "rank": 0,
>               "name": "ceph-mon-01",
>               "addr": "10.1.1.64:6789\/0"},
>             { "rank": 1,
>               "name": "ceph-mon-02",
>               "addr": "10.1.1.65:6789\/0"},
>             { "rank": 2,
>               "name": "ceph-mon-03",
>               "addr": "10.1.1.66:6789\/0"}]}}
>
>
> { "name": "ceph-mon-02",
>   "rank": 1,
>   "state": "peon",
>   "election_epoch": 932,
>   "quorum": [
>         0,
>         1],
>   "outside_quorum": [],
>   "extra_probe_peers": [],
>   "sync_provider": [],
>   "monmap": { "epoch": 2,
>       "fsid": "XXX",
>       "modified": "0.000000",
>       "created": "0.000000",
>       "mons": [
>             { "rank": 0,
>               "name": "ceph-mon-01",
>               "addr": "10.1.1.64:6789\/0"},
>             { "rank": 1,
>               "name": "ceph-mon-02",
>               "addr": "10.1.1.65:6789\/0"},
>             { "rank": 2,
>               "name": "ceph-mon-03",
>               "addr": "10.1.1.66:6789\/0"}]}}
>
>
> { "name": "ceph-mon-03",
>   "rank": 2,
>   "state": "electing",
>   "election_epoch": 931,
>   "quorum": [],
>   "outside_quorum": [],
>   "extra_probe_peers": [],
>   "sync_provider": [],
>   "monmap": { "epoch": 2,
>       "fsid": "XXX",
>       "modified": "0.000000",
>       "created": "0.000000",
>       "mons": [
>             { "rank": 0,
>               "name": "ceph-mon-01",
>               "addr": "10.1.1.64:6789\/0"},
>             { "rank": 1,
>               "name": "ceph-mon-02",
>               "addr": "10.1.1.65:6789\/0"},
>             { "rank": 2,
>               "name": "ceph-mon-03",
>               "addr": "10.1.1.66:6789\/0"}]}}
>
>
> Any help or advice is appreciated.

It looks like your mon has been unable to communicate with the other
hosts, presumably since the time you un-/replugged it. Check your
switch port configuration. Also, make sure that from 10.1.1.66, you
can not only ping 10.1.1.64 and 10.1.1.65, but make a TCP connection
on port 6789. With that out of the way, check your mon log on
ceph-mon-03 (in /var/log/ceph/mon); it should provide some additional
insight into the problem.

Cheers,
Florian