On Mon, Jul 25, 2016 at 9:37 PM, Sergio A. de Carvalho Jr.
<scarvalhojr@xxxxxxxxx <mailto:scarvalhojr@xxxxxxxxx>> wrote:
Awesome, thanks so much, Joao.
Here's the mon_status:
https://gist.github.com/anonymous/2b80a9a75d134d9e539dfbc81615c055
I'm still trying to collect the logs, but while doing that I noticed
that the log records are severely delayed compared to the system
clock. For example, watching the logs with tail -f, I see records
with a timestamp that is up to 28 minutes behind the system clock!
Also, while trying to set debug level, the monitors sometimes hung
for several minutes, so there's obviously something wrong with them.
On Mon, Jul 25, 2016 at 6:16 PM, Joao Eduardo Luis <joao@xxxxxxx
<mailto:joao@xxxxxxx>> wrote:
On 07/25/2016 05:55 PM, Sergio A. de Carvalho Jr. wrote:
I just forced an NTP updated on all hosts to be sure it's
down to clock
skew. I also checked that hosts can reach all other hosts on
port 6789.
I then stopped monitor 0 (60z0m02) and started monitor 1
(60zxl02), but
the 3 monitors left (1 - 60zxl02, 2 - 610wl02, 4 - 615yl02)
were still
having problems to reach quorum. That led me to believe
monitor 4 was
the problem because I had a quorum before with monitors 0, 1, 2.
So I stopped monitor 4 and started monitor 0 again, but this
time
monitors 0, 1, 2 failed to reach a quorum, which is rather
puzzling.
All hosts are pretty much idle all the time so I can't see
why monitors
would be getting stuck.
Grab 'ceph daemon mon.<ID> mon_status' from all monitors, set
'debug mon = 10', 'debug paxos = 10', 'debug ms = 1', grab logs
from the monitors for the periods before, during and after
election (ideally for at least two election cycles). Put them up
somewhere on the interwebs and send us a link.
If you don't feel comfortable putting that link in the list,
send me an email directly with the url.
I'll be happy to take a look later tonight.
-Joao
On Mon, Jul 25, 2016 at 5:18 PM, Joao Eduardo Luis
<joao@xxxxxxx <mailto:joao@xxxxxxx>
<mailto:joao@xxxxxxx <mailto:joao@xxxxxxx>>> wrote:
On 07/25/2016 04:34 PM, Sergio A. de Carvalho Jr. wrote:
Thanks, Joao.
All monitors have the exact same mom map.
I suspect you're right that there might be some
communication
problem
though. I stopped monitor 1 (60zxl02), but the
other 3 monitors
still
failed to reach a quorum. I could see monitor 0 was
still declaring
victory but the others were always calling for new
elections:
2016-07-25 15:18:59.775144 7f8760af7700 0
log_channel(cluster) log
[INF] : mon.60z0m02@0 won leader election with
quorum 0,2,4
2016-07-25 15:18:54.702176 7fc1b357d700 1
mon.610wl02@2(electing) e5
handle_timecheck drop unexpected msg
2016-07-25 15:18:54.704526 7fc1b357d700 1
mon.610wl02@2(electing).data_health(11626)
service_dispatch not in
quorum -- drop message
2016-07-25 15:19:09.792511 <tel:09.792511>
<tel:09.792511 <tel:09.792511>> 7fc1b3f7e700 1
mon.610wl02@2(peon).paxos(paxos recovering c
1318755..1319322)
lease_timeout -- calling new election
2016-07-25 15:19:09.792825 <tel:09.792825>
<tel:09.792825 <tel:09.792825>> 7fc1b357d700 0
log_channel(cluster) log
[INF] : mon.610wl02 calling new monitor election
I'm curious about the "handle_timecheck drop
unexpected msg"
message.
timechecks (i.e., checking for clock skew), as well as the
data_health service (which makes sure you have enough
disk space in
the mon data dir) are only run when you have a quorum.
If a message
is received by a monitor not in a quorum, regardless of
state, it
will be dropped.
Assuming you know took one of the self-appointed
leaders out - by
shutting it down, for instance -, you should now check
what's
causing elections not to hold.
In these cases, assuming your 3 monitors do form a
quorum, the
traditional issue tends to be 'lease timeouts'. I.e.,
the leader
fails to provide a lease extension on paxos for the
peons, and the
peons assume the leader failed in some form
(unresponsive, down,
whatever).
Above it does seem a lease timeout was triggered on a
peon. This may
have happened because:
1. leader did not extend the lease
2. leader did extend the lease but lease was in the
past - usually
indication of a clock skew on the leader, on the peons,
or both.
3. leader did extend the lease, it was with the correct
time but
peon failed to dispatch the message on time.
Both 1. and 2. may be due to several factors, but most
commonly it's
because the monitor was stuck doing something. This
something, more
often than not, is leveldb. If this is the case, check
the size of
your leveldb. If it is over 5 or 6GB in size, you may
need to
manually compact the store (mon compact on start =
true, iirc).
HTH
-Joao
On Mon, Jul 25, 2016 at 4:10 PM, Joao Eduardo Luis
<joao@xxxxxxx <mailto:joao@xxxxxxx>
<mailto:joao@xxxxxxx <mailto:joao@xxxxxxx>>
<mailto:joao@xxxxxxx <mailto:joao@xxxxxxx>
<mailto:joao@xxxxxxx <mailto:joao@xxxxxxx>>>> wrote:
On 07/25/2016 03:41 PM, Sergio A. de Carvalho
Jr. wrote:
In the logs, there 2 monitors are
constantly reporting
that they
won the
leader election:
60z0m02 (monitor 0):
2016-07-25 14:31:11.644335 7f8760af7700 0
log_channel(cluster) log
[INF] : mon.60z0m02@0 won leader election
with quorum 0,2,4
2016-07-25 14:31:44.521552 7f8760af7700 1
mon.60z0m02@0(leader).paxos(paxos recovering c
1318755..1319320)
collect
timeout, calling fresh election
60zxl02 (monitor 1):
2016-07-25 14:31:59.542346 7fefdeaed700 1
mon.60zxl02@1(electing).elector(11441)
init, last seen
epoch 11441
2016-07-25 14:32:04.583929 7fefdf4ee700 0
log_channel(cluster) log
[INF] : mon.60zxl02@1 won leader election
with quorum 1,2,4
2016-07-25 14:32:33.440103 7fefdf4ee700 1
mon.60zxl02@1(leader).paxos(paxos recovering c
1318755..1319319)
collect
timeout, calling fresh election
There are two likely scenarios to explain this:
1. The monitors have different monitors in
their monmaps - this
could happen if you didn't add the new monitor
via 'ceph
mon add'.
You can check this by running 'ceph daemon
mon.<ID>
mon_status' for
each of the monitors in the cluster.
2. some of the monitors are unable to
communicate with each
other,
thus will never acknowledge the same leader.
This does not
mean you
have two leaders for the same cluster, but it
does mean
that you
will end up having two monitors declaring
victory and
become the
self-proclaimed leader in the cluster. The
peons should
still only
belong to one quorum.
If this does not help you, try setting 'debug
mon = 10' and
'debug
ms = 1' on the monitors and check the logs,
making sure the
monitors
get the probes and follow the election
process. If you need
further
assistance, put those logs online somewhere we
can access
them and
we'll try to help you out.
-Joao
On Mon, Jul 25, 2016 at 3:27 PM, Sergio A.
de Carvalho Jr.
<scarvalhojr@xxxxxxxxx
<mailto:scarvalhojr@xxxxxxxxx> <mailto:scarvalhojr@xxxxxxxxx
<mailto:scarvalhojr@xxxxxxxxx>>
<mailto:scarvalhojr@xxxxxxxxx
<mailto:scarvalhojr@xxxxxxxxx> <mailto:scarvalhojr@xxxxxxxxx
<mailto:scarvalhojr@xxxxxxxxx>>>
<mailto:scarvalhojr@xxxxxxxxx
<mailto:scarvalhojr@xxxxxxxxx>
<mailto:scarvalhojr@xxxxxxxxx
<mailto:scarvalhojr@xxxxxxxxx>>
<mailto:scarvalhojr@xxxxxxxxx <mailto:scarvalhojr@xxxxxxxxx>
<mailto:scarvalhojr@xxxxxxxxx
<mailto:scarvalhojr@xxxxxxxxx>>>>>
wrote:
Hi,
I have a cluster of 5 hosts running
Ceph 0.94.6 on
CentOS
6.5. On
each host, there is 1 monitor and 13
OSDs. We had
an issue
with the
network and for some reason (which I
still don't
know why), the
servers were restarted. One host is
still down,
but the
monitors on
the 4 remaining servers are failing
to enter a quorum.
I managed to get a quorum of 3
monitors by
stopping all Ceph
monitors and OSDs across all
machines, and
bringing up the
top 3
ranked monitors in order of rank.
After a few
minutes, the
60z0m02
monitor (the top ranked one) became
the leader:
{
"name": "60z0m02",
"rank": 0,
"state": "leader",
"election_epoch": 11328,
"quorum": [
0,
1,
2
],
"outside_quorum": [],
"extra_probe_peers": [],
"sync_provider": [],
"monmap": {
"epoch": 5,
"fsid":
"2f51a247-3155-4bcf-9aee-c6f6b2c5e2af",
"modified": "2016-04-28
22:26:48.604393",
"created": "0.000000",
"mons": [
{
"rank": 0,
"name": "60z0m02",
"addr":
"10.98.2.166:6789 <http://10.98.2.166:6789>
<http://10.98.2.166:6789>
<http://10.98.2.166:6789>
<http://10.98.2.166:6789>\/0"
},
{
"rank": 1,
"name": "60zxl02",
"addr":
"10.98.2.167:6789 <http://10.98.2.167:6789>
<http://10.98.2.167:6789>
<http://10.98.2.167:6789>
<http://10.98.2.167:6789>\/0"
},
{
"rank": 2,
"name": "610wl02",
"addr":
"10.98.2.173:6789 <http://10.98.2.173:6789>
<http://10.98.2.173:6789>
<http://10.98.2.173:6789>
<http://10.98.2.173:6789>\/0"
},
{
"rank": 3,
"name": "618yl02",
"addr":
"10.98.2.214:6789 <http://10.98.2.214:6789>
<http://10.98.2.214:6789>
<http://10.98.2.214:6789>
<http://10.98.2.214:6789>\/0"
},
{
"rank": 4,
"name": "615yl02",
"addr":
"10.98.2.216:6789 <http://10.98.2.216:6789>
<http://10.98.2.216:6789>
<http://10.98.2.216:6789>
<http://10.98.2.216:6789>\/0"
}
]
}
}
The other 2 monitors became peons:
"name": "60zxl02",
"rank": 1,
"state": "peon",
"election_epoch": 11328,
"quorum": [
0,
1,
2
],
"name": "610wl02",
"rank": 2,
"state": "peon",
"election_epoch": 11328,
"quorum": [
0,
1,
2
],
I then proceeded to start the fourth
monitor, 615yl02
(618yl02 is
powered off), but after more than 2
hours and
several election
rounds, the monitors still haven't
reached a
quorum. The
monitors
alternate mostly between "election",
"probing"
states but
they often
seem to be in different election epochs.
Is this normal?
Is there anything I can do to help
the monitors
elect a leader?
Should I manually remove the dead
host's monitor
from the
monitor map?
I left all OSD daemons stopped while
the election
is going on
purpose. Is this the best thing to
do? Would
bringing the
OSDs up
help or complicate matters even more?
Or doesn't
it make
any difference?
I don't see anything obviously wrong
in the
monitor logs.
They're
mostly filled with messages like the
following:
2016-07-25 14:17:57.806148
7fc1b3f7e700 1
mon.610wl02@2(electing).elector(11411) init, last seen
epoch 11411
2016-07-25 14:17:57.829198
7fc1b7caf700 0
log_channel(audit) log
[DBG] : from='admin socket'
entity='admin socket'
cmd='mon_status'
args=[]: dispatch
2016-07-25 14:17:57.829200
7fc1b7caf700 0
log_channel(audit) do_log
log to syslog
2016-07-25 14:17:57.829254
7fc1b7caf700 0
log_channel(audit) log
[DBG] : from='admin socket'
entity='admin socket'
cmd=mon_status
args=[]: finished
Any help would be hugely appreciated.
Thanks,
Sergio
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
<mailto:ceph-users@xxxxxxxxxxxxxx
<mailto:ceph-users@xxxxxxxxxxxxxx>>
<mailto:ceph-users@xxxxxxxxxxxxxx
<mailto:ceph-users@xxxxxxxxxxxxxx>
<mailto:ceph-users@xxxxxxxxxxxxxx
<mailto:ceph-users@xxxxxxxxxxxxxx>>>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
<mailto:ceph-users@xxxxxxxxxxxxxx
<mailto:ceph-users@xxxxxxxxxxxxxx>>
<mailto:ceph-users@xxxxxxxxxxxxxx
<mailto:ceph-users@xxxxxxxxxxxxxx>
<mailto:ceph-users@xxxxxxxxxxxxxx
<mailto:ceph-users@xxxxxxxxxxxxxx>>>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com