Re: 1 mon unable to join the quorum

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



See my latest update in the tracker.

On Sun, Apr 1, 2018 at 2:27 AM, Julien Lavesque
<julien.lavesque@xxxxxxxxxxxxxxxxxx> wrote:
> At first the cluster has been deployed using ceph-ansible in version
> infernalis.
> For some unknown reason the controller02 was out of the quorum and we were
> unable to add it in the quorum.
>
> We have updated the cluster to jewel version using the rolling-update
> playbook from ceph-ansible
>
> The controller02 was still not in the quorum.
>
> We tried to delete the mon completely and add it again using the manual
> method of http://docs.ceph.com/docs/jewel/rados/operations/add-or-rm-mons/
> (with id controller02)
>
> The logs provided are when the controller02 was added with the manual
> method.
>
> But the controller02 won't join the cluster
>
> Hope It helps understand
>
>
>
> On 31/03/2018 02:12, Brad Hubbard wrote:
>>
>> I'm not sure I completely understand your "test". What exactly are you
>> trying to achieve and what documentation are you following?
>>
>> On Fri, Mar 30, 2018 at 10:49 PM, Julien Lavesque
>> <julien.lavesque@xxxxxxxxxxxxxxxxxx> wrote:
>>>
>>> Brad,
>>>
>>> Thanks for your answer
>>>
>>> On 30/03/2018 02:09, Brad Hubbard wrote:
>>>>
>>>>
>>>> 2018-03-19 11:03:50.819493 7f842ed47640  0 mon.controller02 does not
>>>> exist in monmap, will attempt to join an existing cluster
>>>> 2018-03-19 11:03:50.820323 7f842ed47640  0 starting mon.controller02
>>>> rank -1 at 172.18.8.6:6789/0 mon_data
>>>> /var/lib/ceph/mon/ceph-controller02 fsid
>>>> f37f31b1-92c5-47c8-9834-1757a677d020
>>>>
>>>> We are called 'mon.controller02' and we can not find our name in the
>>>> local copy of the monmap.
>>>>
>>>> 2018-03-19 11:03:52.346318 7f842735d700 10
>>>> mon.controller02@-1(probing) e68  ready to join, but i'm not in the
>>>> monmap or my addr is blank, trying to join
>>>>
>>>> Our name is not in the copy of the monmap we got from peer controller01
>>>> either.
>>>
>>>
>>>
>>> During our test we have deleted completely the controller02 monitor and
>>> add
>>> it again.
>>>
>>> The log you have is when the controller02 is added (so it wasn't in the
>>> monmap before)
>>>
>>>
>>>>
>>>> $ cat ../controller02-mon_status.log
>>>> [root@controller02 ~]# ceph --admin-daemon
>>>> /var/run/ceph/ceph-mon.controller02.asok mon_status
>>>> {
>>>>     "name": "controller02",
>>>>     "rank": 1,
>>>>     "state": "electing",
>>>>     "election_epoch": 32749,
>>>>     "quorum": [],
>>>>     "outside_quorum": [],
>>>>     "extra_probe_peers": [],
>>>>     "sync_provider": [],
>>>>     "monmap": {
>>>>         "epoch": 71,
>>>>         "fsid": "f37f31b1-92c5-47c8-9834-1757a677d020",
>>>>         "modified": "2018-03-29 10:48:06.371157",
>>>>         "created": "0.000000",
>>>>         "mons": [
>>>>             {
>>>>                 "rank": 0,
>>>>                 "name": "controller01",
>>>>                 "addr": "172.18.8.5:6789\/0"
>>>>             },
>>>>             {
>>>>                 "rank": 1,
>>>>                 "name": "controller02",
>>>>                 "addr": "172.18.8.6:6789\/0"
>>>>             },
>>>>             {
>>>>                 "rank": 2,
>>>>                 "name": "controller03",
>>>>                 "addr": "172.18.8.7:6789\/0"
>>>>             }
>>>>         ]
>>>>     }
>>>> }
>>>>
>>>> In the monmaps we are called 'controller02', not 'mon.controller02'.
>>>> These names need to be identical.
>>>>
>>>
>>> The cluster has been deployed using ceph-ansible with the servers
>>> hostname.
>>> All monitors are called mon.controller0x in the monmap and all the 3
>>> monitors have the same configuration
>>>
>>> We have the same behavior creating a monmap from scratch :
>>>
>>> [root@controller03 ~]# monmaptool --create --add controller01
>>> 172.18.8.5:6789 --add controller02 172.18.8.6:6789 --add controller03
>>> 172.18.8.7:6789 --fsid f37f31b1-92c5-47c8-9834-1757a677d020 --clobber
>>> test-monmap
>>> monmaptool: monmap file test-monmap
>>> monmaptool: set fsid to f37f31b1-92c5-47c8-9834-1757a677d020
>>> monmaptool: writing epoch 0 to test-monmap (3 monitors)
>>>
>>> [root@controller03 ~]# monmaptool --print test-monmap
>>> monmaptool: monmap file test-monmap
>>> epoch 0
>>> fsid f37f31b1-92c5-47c8-9834-1757a677d020
>>> last_changed 2018-03-30 14:42:18.809719
>>> created 2018-03-30 14:42:18.809719
>>> 0: 172.18.8.5:6789/0 mon.controller01
>>> 1: 172.18.8.6:6789/0 mon.controller02
>>> 2: 172.18.8.7:6789/0 mon.controller03
>>>
>>>
>>>>
>>>> On Thu, Mar 29, 2018 at 7:23 PM, Julien Lavesque
>>>> <julien.lavesque@xxxxxxxxxxxxxxxxxx> wrote:
>>>>>
>>>>>
>>>>> Hi Brad,
>>>>>
>>>>> The results have been uploaded on the tracker
>>>>> (https://tracker.ceph.com/issues/23403)
>>>>>
>>>>> Julien
>>>>>
>>>>>
>>>>> On 29/03/2018 07:54, Brad Hubbard wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> Can you update with the result of the following commands from all of
>>>>>> the
>>>>>> MONs?
>>>>>>
>>>>>> # ceph --admin-daemon /var/run/ceph/ceph-mon.[whatever].asok
>>>>>> mon_status
>>>>>> # ceph --admin-daemon /var/run/ceph/ceph-mon.[whatever].asok
>>>>>> quorum_status
>>>>>>
>>>>>> On Thu, Mar 29, 2018 at 3:11 PM, Gauvain Pocentek
>>>>>> <gauvain.pocentek@xxxxxxxxxxxxxxxxxx> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hello Ceph users,
>>>>>>>
>>>>>>> We are having a problem on a ceph cluster running Jewel: one of the
>>>>>>> mons
>>>>>>> left the quorum, and we  have not been able to make it join again.
>>>>>>> The
>>>>>>> two
>>>>>>> other monitors are running just fine, but obviously we need this
>>>>>>> third
>>>>>>> one.
>>>>>>>
>>>>>>> The problem happened before Jewel, when the cluster was running
>>>>>>> Infernalis.
>>>>>>> We upgraded hoping that it would solve the problem, but no luck.
>>>>>>>
>>>>>>> We've validated several things: no network problem, no clock skew,
>>>>>>> same
>>>>>>> OS
>>>>>>> and ceph version everywhere. We've also removed the mon completely,
>>>>>>> and
>>>>>>> recreated it. We also tried to run an additional mon on one of the
>>>>>>> OSD
>>>>>>> machines, this mon didn't join the quorum either.
>>>>>>>
>>>>>>> We've opened https://tracker.ceph.com/issues/23403 with logs from the
>>>>>>> 3
>>>>>>> mons
>>>>>>> during a fresh startup of the problematic logs.
>>>>>>>
>>>>>>> Is there anything we could try to do to resolve this issue? We are
>>>>>>> getting
>>>>>>> out of ideas.
>>>>>>>
>>>>>>> We'd appreciate any suggestion!
>>>>>>>
>>>>>>> Gauvain Pocentek
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux