Re: 答复: One of three monitors can not be started

张皓宇 <zhanghaoyu1988@xxxxxxxxxxx> · Thu, 2 Apr 2015 13:57:41 +0800

 i checked the cluster state, it has recoveried to HEALTH_OK. i don's know why.

yesterday, 09:02, i started the mon.computer06 , it can not be started, the log‘s in attachment 0902.

and 16:38, i started the mon.computer06 again,  it also stucked with these processes:
/usr/bin/ceph-mon -i computer06 --pid-file /var/run/ceph/mon.computer06.pid -c /etc/ceph/ceph.conf
 /usr/sbin/ceph-create-keys -i computer06

but in this morning, it just be ok. the log's in attachment 1638. anyone can explain that?

To: greg@xxxxxxxxxxx
From: zhanghaoyu1988@xxxxxxxxxxx
Subject: 答复: [ceph-users] One of three monitors can not be started
Date: Thu, 2 Apr 2015 07:53:19 +0800

it has no reponds.

发件人:
Gregory Farnum

发送时间:
‎2015/‎4/‎2 1:01

收件人:
张皓宇

主题:
Re: [ceph-users] One of three monitors can not be started

On Tue, Mar 31, 2015 at 10:25 PM, 张皓宇 <zhanghaoyu1988@xxxxxxxxxxx> wrote:

> There is asok on computer06.

> I tried to start the mon.computer06, maybe two hours later,  the

> mon.computer06 still not start,

> but there are some different processes on computer06, I don't know how to

> handle it:

> root      7812     1  0 11:39 pts/4    00:00:00 python

> /usr/sbin/ceph-create-keys -i computer06

That's a thing that runs on every monitor invocation to make sure

necessary keys are in place; it's just stuck because the monitor isn't

working.

> root     11025     1 12 09:02 pts/4    00:32:13 /usr/bin/ceph-mon -i

> computer06 --pid-file /var/run/ceph/mon.computer06.pid -c

> /etc/ceph/ceph.conf

That's the monitor.

> root     35692  7812  0 12:59 pts/4    00:00:00 python /usr/bin/ceph

> --cluster=ceph --admin-daemon=/var/run/ceph/ceph-mon.computer06.asok

> mon_status

This is an attempt of yours to invoke mon_status on the admin socket.

So you're saying the admin socket is there but it's not responding to

queries?

>

>

> I got the quorum_status from another running monitor:

> { "election_epoch": 508,

>   "quorum": [

>         0,

>         1],

>   "quorum_names": [

>         "computer05",

>         "computer04"],

>   "quorum_leader_name": "computer04",

>   "monmap": { "epoch": 4,

>       "fsid": "471483e5-493f-41f6-b6f4-0187c13d156d",

>       "modified": "2014-07-26 09:52:02.411967",

>       "created": "0.000000",

>       "mons": [

>             { "rank": 0,

>               "name": "computer04",

>               "addr": "192.168.1.60:6789\/0"},

>             { "rank": 1,

>               "name": "computer05",

>               "addr": "192.168.1.65:6789\/0"},

>             { "rank": 2,

>               "name": "computer06",

>               "addr": "192.168.1.66:6789\/0"}]}}

And that indicates mon.computer04 and mon.computer05 are working and

in a quorum together to make progress.

You said that computer05 got compacted, but that computer06 broke?

Given that computer04 is doing fine, it may not be related. If you

gather a log from mon.computer06 trying to start up (with "debug mon =

20" in the config file to dump a lot of output) somebody may be able

to help you.

-Greg

>

>

>

>> Date: Tue, 31 Mar 2015 12:30:22 -0700

>> Subject: Re: [ceph-users] One of three monitors can not be started

>> From: greg@xxxxxxxxxxx

>> To: zhanghaoyu1988@xxxxxxxxxxx

>> CC: ceph-users@xxxxxxxxxxxxxx

>

>>

>> On Tue, Mar 31, 2015 at 2:50 AM, 张皓宇 <zhanghaoyu1988@xxxxxxxxxxx> wrote:

>> > Who can help me?

>> >

>> > One monitor in my ceph cluster can not be started.

>> > Before that, I added '[mon] mon_compact_on_start = true' to

>> > /etc/ceph/ceph.conf on three monitor hosts. Then I did 'ceph tell

>> > mon.computer05 compact ' on computer05, which has a monitor on it.

>> > When store.db of computer05 changed from 108G to 1G, mon.computer06

>> > stoped,

>> > and it can not be started since that.

>> >

>> > If I start mon.computer06, it will stop on this state:

>> > # /etc/init.d/ceph start mon.computer06

>> > === mon.computer06 ===

>> > Starting Ceph mon.computer06 on computer06...

>> >

>> > The process info is like this:

>> > root 12149 3807 0 20:46 pts/27 00:00:00 /bin/sh /etc/init.d/ceph start

>> > mon.computer06

>> > root 12308 12149 0 20:46 pts/27 00:00:00 bash -c ulimit -n 32768;

>> > /usr/bin/ceph-mon -i computer06 --pid-file

>> > /var/run/ceph/mon.computer06.pid

>> > -c /etc/ceph/ceph.conf

>> > root 12309 12308 0 20:46 pts/27 00:00:00 /usr/bin/ceph-mon -i computer06

>> > --pid-file /var/run/ceph/mon.computer06.pid -c /etc/ceph/ceph.conf

>> > root 12313 12309 19 20:46 pts/27 00:00:01 /usr/bin/ceph-mon -i

>> > computer06

>> > --pid-file /var/run/ceph/mon.computer06.pid -c /etc/ceph/ceph.conf

>> >

>> > Log on computer06 is like this:

>> > 2015-03-30 20:46:54.152956 7fc5379d07a0 0 ceph version 0.72.2

>> > (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mon, pid 12309

>> > ...

>> > 2015-03-30 20:46:54.759791 7fc5379d07a0 1 mon.computer06@-1(probing) e4

>> > preinit clean up potentially inconsistent store state

>>

>> So I haven't looked at this code in a while, but I think the monitor

>> is trying to validate that it's consistent with the others. You

>> probably want to dig around the monitor admin sockets and see what

>> state each monitor is in, plus its perception of the others.

>>

>> In this case, I think maybe mon.computer06 is trying to examine its

>> whole store, but 100GB is a lot (way too much, in fact), so this can

>> take a loooong time.

>>

>> >

>> > Sorry, my English is not good.

>> >

>> > _______________________________________________

>> > ceph-users mailing list

>> > ceph-users@xxxxxxxxxxxxxx

>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>> >

Attachment:
0902

Description: Binary data
Attachment:
1638

Description: Binary data
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com