Re: 1 mon unable to join the quorum

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



At first the cluster has been deployed using ceph-ansible in version infernalis. For some unknown reason the controller02 was out of the quorum and we were unable to add it in the quorum.

We have updated the cluster to jewel version using the rolling-update playbook from ceph-ansible

The controller02 was still not in the quorum.

We tried to delete the mon completely and add it again using the manual method of http://docs.ceph.com/docs/jewel/rados/operations/add-or-rm-mons/ (with id controller02)

The logs provided are when the controller02 was added with the manual method.

But the controller02 won't join the cluster

Hope It helps understand


On 31/03/2018 02:12, Brad Hubbard wrote:
I'm not sure I completely understand your "test". What exactly are you
trying to achieve and what documentation are you following?

On Fri, Mar 30, 2018 at 10:49 PM, Julien Lavesque
<julien.lavesque@xxxxxxxxxxxxxxxxxx> wrote:
Brad,

Thanks for your answer

On 30/03/2018 02:09, Brad Hubbard wrote:

2018-03-19 11:03:50.819493 7f842ed47640  0 mon.controller02 does not
exist in monmap, will attempt to join an existing cluster
2018-03-19 11:03:50.820323 7f842ed47640  0 starting mon.controller02
rank -1 at 172.18.8.6:6789/0 mon_data
/var/lib/ceph/mon/ceph-controller02 fsid
f37f31b1-92c5-47c8-9834-1757a677d020

We are called 'mon.controller02' and we can not find our name in the
local copy of the monmap.

2018-03-19 11:03:52.346318 7f842735d700 10
mon.controller02@-1(probing) e68  ready to join, but i'm not in the
monmap or my addr is blank, trying to join

Our name is not in the copy of the monmap we got from peer controller01
either.


During our test we have deleted completely the controller02 monitor and add
it again.

The log you have is when the controller02 is added (so it wasn't in the
monmap before)



$ cat ../controller02-mon_status.log
[root@controller02 ~]# ceph --admin-daemon
/var/run/ceph/ceph-mon.controller02.asok mon_status
{
    "name": "controller02",
    "rank": 1,
    "state": "electing",
    "election_epoch": 32749,
    "quorum": [],
    "outside_quorum": [],
    "extra_probe_peers": [],
    "sync_provider": [],
    "monmap": {
        "epoch": 71,
        "fsid": "f37f31b1-92c5-47c8-9834-1757a677d020",
        "modified": "2018-03-29 10:48:06.371157",
        "created": "0.000000",
        "mons": [
            {
                "rank": 0,
                "name": "controller01",
                "addr": "172.18.8.5:6789\/0"
            },
            {
                "rank": 1,
                "name": "controller02",
                "addr": "172.18.8.6:6789\/0"
            },
            {
                "rank": 2,
                "name": "controller03",
                "addr": "172.18.8.7:6789\/0"
            }
        ]
    }
}

In the monmaps we are called 'controller02', not 'mon.controller02'.
These names need to be identical.


The cluster has been deployed using ceph-ansible with the servers hostname.
All monitors are called mon.controller0x in the monmap and all the 3
monitors have the same configuration

We have the same behavior creating a monmap from scratch :

[root@controller03 ~]# monmaptool --create --add controller01
172.18.8.5:6789 --add controller02 172.18.8.6:6789 --add controller03
172.18.8.7:6789 --fsid f37f31b1-92c5-47c8-9834-1757a677d020 --clobber
test-monmap
monmaptool: monmap file test-monmap
monmaptool: set fsid to f37f31b1-92c5-47c8-9834-1757a677d020
monmaptool: writing epoch 0 to test-monmap (3 monitors)

[root@controller03 ~]# monmaptool --print test-monmap
monmaptool: monmap file test-monmap
epoch 0
fsid f37f31b1-92c5-47c8-9834-1757a677d020
last_changed 2018-03-30 14:42:18.809719
created 2018-03-30 14:42:18.809719
0: 172.18.8.5:6789/0 mon.controller01
1: 172.18.8.6:6789/0 mon.controller02
2: 172.18.8.7:6789/0 mon.controller03



On Thu, Mar 29, 2018 at 7:23 PM, Julien Lavesque
<julien.lavesque@xxxxxxxxxxxxxxxxxx> wrote:

Hi Brad,

The results have been uploaded on the tracker
(https://tracker.ceph.com/issues/23403)

Julien


On 29/03/2018 07:54, Brad Hubbard wrote:


Can you update with the result of the following commands from all of the
MONs?

# ceph --admin-daemon /var/run/ceph/ceph-mon.[whatever].asok mon_status
# ceph --admin-daemon /var/run/ceph/ceph-mon.[whatever].asok
quorum_status

On Thu, Mar 29, 2018 at 3:11 PM, Gauvain Pocentek
<gauvain.pocentek@xxxxxxxxxxxxxxxxxx> wrote:


Hello Ceph users,

We are having a problem on a ceph cluster running Jewel: one of the
mons
left the quorum, and we have not been able to make it join again. The
two
other monitors are running just fine, but obviously we need this third
one.

The problem happened before Jewel, when the cluster was running
Infernalis.
We upgraded hoping that it would solve the problem, but no luck.

We've validated several things: no network problem, no clock skew, same
OS
and ceph version everywhere. We've also removed the mon completely, and recreated it. We also tried to run an additional mon on one of the OSD
machines, this mon didn't join the quorum either.

We've opened https://tracker.ceph.com/issues/23403 with logs from the 3
mons
during a fresh startup of the problematic logs.

Is there anything we could try to do to resolve this issue? We are
getting
out of ideas.

We'd appreciate any suggestion!

Gauvain Pocentek

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux