monitor quorum

james.eckersall@xxxxxxxxx (James Eckersall) · Wed, 17 Sep 2014 16:21:13 +0100

Hi,

Thanks for the advice.

I feel pretty dumb as it does indeed look like a simple networking issue.
 You know how you check things 5 times and miss the most obvious one...

J

On 17 September 2014 16:04, Florian Haas <florian at hastexo.com> wrote:

> On Wed, Sep 17, 2014 at 1:58 PM, James Eckersall
> <james.eckersall at gmail.com> wrote:
> > Hi,
> >
> > I have a ceph cluster running 0.80.1 on Ubuntu 14.04.  I have 3 monitors
> and
> > 4 OSD nodes currently.
> >
> > Everything has been running great up until today where I've got an issue
> > with the monitors.
> > I moved mon03 to a different switchport so it would have temporarily lost
> > connectivity.
> > Since then, the cluster is reporting that that mon is down, although it's
> > definitely up.
> > I've tried restarting the mon services on all three mons, but that hasn't
> > made a difference.
> > I definitely, 100% do not have any clock skew on any of the mons.  This
> has
> > been triple-checked as the ceph docs seem to suggest that might be the
> cause
> > of this issue.
> >
> > Here is what ceph -s and ceph health detail are reporting as well as the
> > mon_status for each monitor:
> >
> >
> > # ceph -s ; ceph health detail
> >     cluster XXX
> >      health HEALTH_WARN 1 mons down, quorum 0,1 ceph-mon-01,ceph-mon-02
> >      monmap e2: 3 mons at
> > {ceph-mon-01=
> 10.1.1.64:6789/0,ceph-mon-02=10.1.1.65:6789/0,ceph-mon-03=10.1.1.66:6789/0
> },
> > election epoch 932, quorum 0,1 ceph-mon-01,ceph-mon-02
> >      osdmap e49213: 80 osds: 80 up, 80 in
> >       pgmap v18242952: 4864 pgs, 5 pools, 69910 GB data, 17638 kobjects
> >             197 TB used, 95904 GB / 290 TB avail
> >                    8 active+clean+scrubbing+deep
> >                 4856 active+clean
> >   client io 6893 kB/s rd, 5657 kB/s wr, 2090 op/s
> > HEALTH_WARN 1 mons down, quorum 0,1 ceph-mon-01,ceph-mon-02
> > mon.ceph-mon-03 (rank 2) addr 10.1.1.66:6789/0 is down (out of quorum)
> >
> >
> > { "name": "ceph-mon-01",
> >   "rank": 0,
> >   "state": "leader",
> >   "election_epoch": 932,
> >   "quorum": [
> >         0,
> >         1],
> >   "outside_quorum": [],
> >   "extra_probe_peers": [],
> >   "sync_provider": [],
> >   "monmap": { "epoch": 2,
> >       "fsid": "XXX",
> >       "modified": "0.000000",
> >       "created": "0.000000",
> >       "mons": [
> >             { "rank": 0,
> >               "name": "ceph-mon-01",
> >               "addr": "10.1.1.64:6789\/0"},
> >             { "rank": 1,
> >               "name": "ceph-mon-02",
> >               "addr": "10.1.1.65:6789\/0"},
> >             { "rank": 2,
> >               "name": "ceph-mon-03",
> >               "addr": "10.1.1.66:6789\/0"}]}}
> >
> >
> > { "name": "ceph-mon-02",
> >   "rank": 1,
> >   "state": "peon",
> >   "election_epoch": 932,
> >   "quorum": [
> >         0,
> >         1],
> >   "outside_quorum": [],
> >   "extra_probe_peers": [],
> >   "sync_provider": [],
> >   "monmap": { "epoch": 2,
> >       "fsid": "XXX",
> >       "modified": "0.000000",
> >       "created": "0.000000",
> >       "mons": [
> >             { "rank": 0,
> >               "name": "ceph-mon-01",
> >               "addr": "10.1.1.64:6789\/0"},
> >             { "rank": 1,
> >               "name": "ceph-mon-02",
> >               "addr": "10.1.1.65:6789\/0"},
> >             { "rank": 2,
> >               "name": "ceph-mon-03",
> >               "addr": "10.1.1.66:6789\/0"}]}}
> >
> >
> > { "name": "ceph-mon-03",
> >   "rank": 2,
> >   "state": "electing",
> >   "election_epoch": 931,
> >   "quorum": [],
> >   "outside_quorum": [],
> >   "extra_probe_peers": [],
> >   "sync_provider": [],
> >   "monmap": { "epoch": 2,
> >       "fsid": "XXX",
> >       "modified": "0.000000",
> >       "created": "0.000000",
> >       "mons": [
> >             { "rank": 0,
> >               "name": "ceph-mon-01",
> >               "addr": "10.1.1.64:6789\/0"},
> >             { "rank": 1,
> >               "name": "ceph-mon-02",
> >               "addr": "10.1.1.65:6789\/0"},
> >             { "rank": 2,
> >               "name": "ceph-mon-03",
> >               "addr": "10.1.1.66:6789\/0"}]}}
> >
> >
> > Any help or advice is appreciated.
>
> It looks like your mon has been unable to communicate with the other
> hosts, presumably since the time you un-/replugged it. Check your
> switch port configuration. Also, make sure that from 10.1.1.66, you
> can not only ping 10.1.1.64 and 10.1.1.65, but make a TCP connection
> on port 6789. With that out of the way, check your mon log on
> ceph-mon-03 (in /var/log/ceph/mon); it should provide some additional
> insight into the problem.
>
> Cheers,
> Florian
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140917/a169f9d1/attachment-0001.htm>