Re: Adding a monitor to

Gregory Farnum <greg@xxxxxxxxxxx> · Wed, 29 Oct 2014 17:00:19 -0700



[Re-adding the list, so this is archived for future posterity.]

On Wed, Oct 29, 2014 at 6:11 AM, Patrick Darley
<patrick.darley@xxxxxxxxxxxxxxx> wrote:
>
> Thanks again for the reply Greg!
>
> On 2014-10-28 17:39, Gregory Farnum wrote:
>>
>> I'm sorry, you're right — I misread it. :(
>
>
> No worries, I had included some misleading words like generate in my rough
> description where retrive would have been more appropriate. Sorry!
>
>> But indeed step 6 is the crucial one, which tells the existing
>> monitors to accept the new one into the cluster. You'll need to run it
>> with an admin client keyring that can connect to the existing cluster;
>> that's probably the part that has gone wrong. You don't need to run it
>> from the new monitor,
>
>
> I think, in order to carry out the 5th step you also need the client.admin
> keyring present, that'd be "preparing the monitors data directory". I had
> scp-ed it across to the monitor along with the ceph.conf file and pu them in
> the expected location, /etc/ceph/, prior to running that command.
>
>> so if you're having trouble getting the keys to
>> behave I'd just run it from an existing system. :)
>
>
> I tried running this command, step 6, from the admin node of my ubuntu ceph
> cluster.
> As I had experienced before, the command hung. Then trying to run any ceph
> commands on the
> rest of the cluster I get a long hang then the following error:
>
>     cc@ucc01:~$ ceph -s
>     2014-10-29 10:40:33.748334 7ffaec051700  0 monclient(hunting):
> authenticate timed out after 300
>     2014-10-29 10:40:33.748499 7ffaec051700  0 librados: client.admin
> authentication error (110) Connection timed out
>     Error connecting to cluster: TimedOut
>
>
> The monitor that I was trying to add can be started ok after this (once I
> have touched the done and sysvinit files) but also gives the
> above error when attempting to run the ceph -s. Checking the log file I see
> the following lines repeated:
>
>
>     2014-10-29 10:01:01.721905 7ffd548ac700  0 mon.bcc07@-1(probing) e0
> handle_probe ignoring fsid 5021163c-3c0b-4ec5-83fe-f0622c0e9447 !=
> f2d609ef-2065-4862-a821-55c484d61dca
>     2014-10-29 10:01:01.809991 7ffd550ad700  1
> mon.bcc07@-1(probing).paxos(paxos recovering c 0..0) is_readable
> now=2014-10-29 10:01:01.809996 lease_expire=0.000000 has v0 lc 0
>     2014-10-29 10:01:03.721559 7ffd548ac700  0 mon.bcc07@-1(probing) e0
> handle_probe ignoring fsid 5021163c-3c0b-4ec5-83fe-f0622c0e9447 !=
> f2d609ef-2065-4862-a821-55c484d61dca
>     2014-10-29 10:01:03.810466 7ffd550ad700  1
> mon.bcc07@-1(probing).paxos(paxos recovering c 0..0) is_readable
> now=2014-10-29 10:01:03.810467 lease_expire=0.000000 has v0 lc 0
>
>
> The initial monitor has the following log at around a similar time:
>
>
>     2014-10-29 10:01:02.169655 7f52e7408700  0 mon.ucc01@1(probing) e2
> handle_probe ignoring fsid f2d609ef-2065-4862-a821-55c484d61dca !=
> 5021163c-3c0b-4ec5-83fe-f0622c0e9447
>     2014-10-29 10:01:04.170153 7f52e7408700  0 mon.ucc01@1(probing) e2
> handle_probe ignoring fsid f2d609ef-2065-4862-a821-55c484d61dca !=
> 5021163c-3c0b-4ec5-83fe-f0622c0e9447
>     2014-10-29 10:01:06.169300 7f52e7408700  0 mon.ucc01@1(probing) e2
> handle_probe ignoring fsid f2d609ef-2065-4862-a821-55c484d61dca !=
> 5021163c-3c0b-4ec5-83fe-f0622c0e9447
>
>
> It looks to me like there might be conflicting fsid values being compared
> somewhere, but checking the ceph.conf files on the
> nodes I found them to be declared as the same. The log files recorded a
> similar output on both monitors for some time.
>
> I then turned off the monitor I was attempting to add at approximately
> 12:39:30 and the log file of the initial
> monitor has the following output around this time:
>
>
>     2014-10-29 12:39:30.304639 7f52e7408700  0 mon.ucc01@1(probing) e2
> handle_probe ignoring fsid f2d609ef-2065-4862-a821-55c484d61dca !=
> 5021163c-3c0b-4ec5-83fe-f0622c0e9447

Okay, that's indeed not right. I suspect this is your issue but I'm
not entirely certain because your other symptoms are a bit weird. I
bet Joao can help though; he maintains the monitor and deals with
these issues a lot more often than I do. :)
-Greg

>     2014-10-29 12:39:32.023964 7f52e7c09700  0
> mon.ucc01@1(probing).data_health(1) update_stats avail 68% total 14318640
> used 3748076 avail 9820180
>     2014-10-29 12:39:32.303740 7f52e7408700  0 mon.ucc01@1(probing) e2
> handle_probe ignoring fsid f2d609ef-2065-4862-a821-55c484d61dca !=
> 5021163c-3c0b-4ec5-83fe-f0622c0e9447
>     2014-10-29 12:39:32.394606 7f52e53fd700  0 -- 192.168.122.95:6789/0 >>
> 192.168.122.42:6789/0 pipe(0x55e5180 sd=24 :6789 s=2 pgs=1 cs=1 l=0
> c=0x39bfde0).fault with nothing to send, going to standby
>     2014-10-29 12:39:33.862400 7f52e5902700  0 -- 192.168.122.95:6789/0 >>
> 192.168.122.42:6789/0 pipe(0x55e5180 sd=13 :6789 s=1 pgs=1 cs=2 l=0
> c=0x39bfde0).fault
>     2014-10-29 12:40:32.024807 7f52e7c09700  0
> mon.ucc01@1(probing).data_health(1) update_stats avail 68% total 14318640
> used 3748072 avail 9820184
>     2014-10-29 12:41:32.025632 7f52e7c09700  0
> mon.ucc01@1(probing).data_health(1) update_stats avail 68% total 14318640
> used 3748072 avail 9820184
>     2014-10-29 12:42:32.027091 7f52e7c09700  0
> mon.ucc01@1(probing).data_health(1) update_stats avail 68% total 14318640
> used 3748072 avail 9820184
>
>
> It seems to me that the initial monitor is probing for the monitor I have
> tried to add,
> then starting this monitor it is probing for for there is some
> conflict with fsid values. Then when the monitor I am trying to add is
> stopped again the,
> initial monitor goes back to probing.
>
>
> Is this what the problem might be? and why would this have happened?
>
> If not, do you know what the problem might be?
>
> And any ideas about what I might have done wrong or what could be done
> differently?
>
> Thanks again,
>
> Patrick
>
>
>
>> On Tue, Oct 28, 2014 at 10:11 AM, Patrick Darley
>> <patrick.darley@xxxxxxxxxxxxxxx> wrote:
>>>
>>> On 2014-10-28 16:08, Gregory Farnum wrote:
>>>>
>>>>
>>>> On Mon, Oct 27, 2014 at 11:37 AM, Patrick Darley
>>>> <patrick.darley@xxxxxxxxxxxxxxx> wrote:
>>>>>
>>>>>
>>>>> Hi there
>>>>>
>>>>> Over the last week or so, I've been trying to connect a ceph monitor
>>>>> node
>>>>> running on a baserock system
>>>>> to connect to a simple 3-node ubuntu ceph cluster.
>>>>>
>>>>> The 3 node ubunutu cluster was created by following the documented
>>>>> Quick
>>>>> installation guide using 3 VMs running ubuntu Trusty.
>>>>>
>>>>> After the ubuntu cluster has been deployed I would then follow the
>>>>> directions below, which I derived from comparing the ceph-deploy debug
>>>>> information, the ceph documentation on adding monitor nodes to an
>>>>> existing
>>>>> system and the ceph documentation on bootstrapping monitor nodes.
>>>>>
>>>>>  1. scp the /etc/ceph/* from admin node
>>>>>  2. create the dir: mkdir /var/lib/ceph/mon/ceph-bcc08
>>>>>  3. generate mon keyring: sudo ceph auth get mon. -o
>>>>> /var/lib/ceph/tmp/ceph-bcc08.mon.keyring
>>>>>  4. generate monmap: sudo ceph mon getmap -o /var/lib/ceph/tmp/monmap
>>>
>>>
>>>
>>>> Yeah, this is wrong. You're here giving the monitor its own keyring
>>>> which it is going to expect anybody to talk to to be encrypting with.
>>>
>>>
>>>
>>> If you are referring to steps 3 and 4 above, I believe these are
>>> synonymous
>>> with
>>> steps 3 and 4 of the documentation you recommended. The monitor keyring
>>> and
>>> the
>>> current monmap are retrieved from the initial monitor. they are then used
>>> in
>>> step 5
>>> to prepare the monitor's data directory.
>>>
>>>
>>>> The docs have a section on adding monitors which should work verbatim;
>>>> if not it's a doc bug:
>>>>
>>>>
>>>>
>>>>
>>>> http://ceph.com/docs/master/rados/operations/add-or-rm-mons/#adding-monitors
>>>
>>>
>>>
>>> Thanks for the recommendation. I have tried to use this procedure a
>>> couple
>>> of times but got stuck at step number
>>> 6 of this method. The command given hangs then times out, causing the
>>> rest
>>> of the cluster to fail.
>>>
>>>> -Greg
>>>
>>>
>>>
>>>
>>> Thanks for the reply!
>>>
>>> Much appreciated,
>>>
>>> Patrick
>>>
>>>
>>>
>>>
>>>
>>>
>>>>>  5. That filesystem thingy: sudo ceph-mon --cluster ceph --mkfs -i
>>>>> bcc08
>>>>> --keyring /var/lib/ceph/tmp/ceph-bcc08.mon.keyring --monmap
>>>>> /var/lib/ceph/tmp/monmap
>>>>>  6. Unlink keys and old monmap: rm /var/lib/ceph/tmp/*
>>>>>  7. touch things: touch /var/lib/ceph/mon/ceph-bcc08/done and touch
>>>>> /var/lib/ceph/mon/ceph-bcc08/sysvinit
>>>>>  8. Then start the mon: sudo /etc/init.d/ceph start mon.bcc08
>>>>>
>>>>> When I carry out these steps in the attempt to add a baserock system to
>>>>> the
>>>>> ubuntu cluster, the monitor node has not been added to the cluster and
>>>>> the
>>>>> admin socket mon_status gives the following output.
>>>>>
>>>>>   ~ # ceph --cluster=ceph --admin-daemon
>>>>> /var/run/ceph/ceph-mon.bcc07.asok
>>>>> mon_status
>>>>>   { "name": "bcc07",
>>>>>     "rank": -1,
>>>>>     "state": "probing",
>>>>>     "election_epoch": 0,
>>>>>     "quorum": [],
>>>>>     "outside_quorum": [],
>>>>>     "extra_probe_peers": [],
>>>>>     "sync_provider": [],
>>>>>     "monmap": { "epoch": 0,
>>>>>         "fsid": "4460079d-42f4-4e3a-8ce3-e2a7fa2685e6",
>>>>>         "modified": "2014-10-27 12:37:25.531542",
>>>>>         "created": "2014-10-27 12:37:25.531542",
>>>>>         "mons": [
>>>>>               { "rank": 0,
>>>>>                 "name": "ucc01",
>>>>>                 "addr": "192.168.122.95:6789\/0"}]}}
>>>>>
>>>>>
>>>>> And the newly added monitor remains stuck in the probing state
>>>>> indefinitely.
>>>>> To try and resolve
>>>>> this issue I have looked at the problems monitor troubleshooting page
>>>>> of
>>>>> the
>>>>> ceph documentation, eg. ntp sychronisation and checking network
>>>>> connectivity
>>>>> (to the best of my ability :-s ).
>>>>>
>>>>> It is also worth mentioning that I have created a 3 node ceph cluster
>>>>> on
>>>>> baserock machines (1 mon, 2 osds) then successfully added monitor nodes
>>>>> running baserock and ubuntu systems using the same 8 step process given
>>>>> above.
>>>>>
>>>>> This leaves me confused as to why adding the monitor run on baserock to
>>>>> the
>>>>> all ubuntu cluster specifically is causing problems.
>>>>>
>>>>> Are there any reasons why this 'probing' problem could be occuring? Im
>>>>> feeling a little stuck of how to proceed and would welcome any
>>>>> suggestions.
>>>>>
>>>>> Thanks for your help,
>>>>>
>>>>> Patrick
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com