Got it.
Are you sending logs to the central syslog servers via TCP (@@<host>) or UDP (@<host>)?
I just realised that my test cluster sends logs via UDP to our usual central syslog server (as our productions hosts normally do), but it is also configured to send logs via TCP to a testing Logstash VM. My suspicion at the moment is that this VM isn't handling the volume of logs and could be blocking rsyslog.
On Wed, Jul 27, 2016 at 12:04 PM, Sean Crosby <richardnixonshead@xxxxxxxxx> wrote:
Oh, my problems weren't on Ceph nodes. I've seen this problem on non-Ceph nodes. The symptoms you had of unexplained weirdness with services (in your case, Ceph), and syslog lagging 10mins behind just reminded me of symptoms I've seen before where the sending of syslog messages to a central syslog server got stuck, and caused unusual problems on the host.Cheers,SeanOn 27 July 2016 at 20:57, Sergio A. de Carvalho Jr. <scarvalhojr@xxxxxxxxx> wrote:In my case, everything else running on the host seems to be okay. I'm wondering if the other problems you see aren't a side-effect of Ceph services running slow?What do you do to get around the problem when it happens? Disable syslog in Ceph?What version of Ceph and OS are you using?On Wed, Jul 27, 2016 at 12:47 AM, Sean Crosby <scrosby@xxxxxxxxxxxxxx> wrote:Agreed. When I first had these problems, random stuff would just not work. SSH would take a while to log in, DNS server would process requests slow, our Batch system would freeze and not run jobs. It's now one of my first things to check when services are running weirdly.My failsafe check is to do# logger "sean test"and see if it appears in syslog. If it doesn't do it immediately, I have a problemCheers,Sean--On 27 July 2016 at 04:01, Sergio A. de Carvalho Jr. <scarvalhojr@xxxxxxxxx> wrote:The funny thing is that I just restarted the rsyslog daemon on the Ceph hosts and I can now re-enable syslog for Ceph without any issues. It just looks like the rsyslog service had a hiccup, possibly related to problem on one of the central syslog servers, and this in turn prevent the monitors to operate normally.It's just scary to think that your logging daemon can cause so much damage!On Tue, Jul 26, 2016 at 6:48 PM, Joao Eduardo Luis <joao@xxxxxxx> wrote:On 07/26/2016 06:27 PM, Sergio A. de Carvalho Jr. wrote:
(Just realised I originally replied to Sean directly, so reposting here
for posterity).
Bingo!
wow. This didn't even cross my mind. D:
Thanks for sharing.
I turned off syslog and the monitors quickly reached quorum and
everything seems back to normal. Thanks so much, Sean.
Luckily this is a test cluster. I wonder how I could catch this in a
production cluster before our support engineers spend a day trying to
track the problem down.
Only way I can see to deal with this sort of thing would be to log to syslog on a separate thread and have said thread monitoring the latency when writing to syslog.
I don't think currently there's any support for that. I'll try to get something concocted this week, mostly for the fun of it.
-Joao
Any ideas?
On Tue, Jul 26, 2016 at 12:28 PM, Sean Crosby
<richardnixonshead@xxxxxxxxx <mailto:richardnixonshead@xxxxxxxxx>> wrote:
Hi Sergio,
You don't happen to have rsyslog forwarding to a central log server
by any chance? I've seen this behaviour before when my central log
server is not keeping up with messages.
Cheers,
Sean
On 26 July 2016 at 21:13, Sergio A. de Carvalho Jr.
<scarvalhojr@xxxxxxxxx <mailto:scarvalhojr@xxxxxxxxx>> wrote:
I left the 4 nodes running overnight and they just crawled to
their knees... to the point that nothing has been written to the
logs in the last 11 hours. So I stopped all monitors this
morning and started them one by one again, but they're are still
being extremely slow. Here are their logs:
<https://gist.github.com/anonymous/85213467f701c5a69c7fdb4e54bc7406><https://gist.github.com/anonymous/85213467f701c5a69c7fdb4e54bc7406><https://gist.github.com/anonymous/85213467f701c5a69c7fdb4e54bc7406>https://gist.github.com/anonymous/85213467f701c5a69c7fdb4e54bc7406
https://gist.github.com/anonymous/f30a8903e701423825fd4d5aaa651e6a
https://gist.github.com/anonymous/42a1856cc819de5b110d9f887e9859d2
https://gist.github.com/anonymous/652bc41197e83a9d76cf5b2e6a211aa2
I'm still puzzled to see logs being written with a timestamp
that is several minutes behind the system clock. As time passes,
the gap widens and quickly the logs are over 10 minutes behind
the actual time, which explains why the logs above don't seem
to overlap.
On Mon, Jul 25, 2016 at 9:37 PM, Sergio A. de Carvalho Jr.
<scarvalhojr@xxxxxxxxx <mailto:scarvalhojr@xxxxxxxxx>> wrote:
Awesome, thanks so much, Joao.
Here's the mon_status:
https://gist.github.com/anonymous/2b80a9a75d134d9e539dfbc81615c055
I'm still trying to collect the logs, but while doing that I
noticed that the log records are severely delayed compared
to the system clock. For example, watching the logs with
tail -f, I see records with a timestamp that is up to 28
minutes behind the system clock!
Also, while trying to set debug level, the monitors
sometimes hung for several minutes, so there's obviously
something wrong with them.
On Mon, Jul 25, 2016 at 6:16 PM, Joao Eduardo Luis<mailto:joao@xxxxxxx <mailto:joao@xxxxxxx>>> wrote:<joao@xxxxxxx <mailto:joao@xxxxxxx>> wrote:
On 07/25/2016 05:55 PM, Sergio A. de Carvalho Jr. wrote:
I just forced an NTP updated on all hosts to be sure
it's down to clock
skew. I also checked that hosts can reach all other
hosts on port 6789.
I then stopped monitor 0 (60z0m02) and started
monitor 1 (60zxl02), but
the 3 monitors left (1 - 60zxl02, 2 - 610wl02, 4 -
615yl02) were still
having problems to reach quorum. That led me to
believe monitor 4 was
the problem because I had a quorum before with
monitors 0, 1, 2.
So I stopped monitor 4 and started monitor 0 again,
but this time
monitors 0, 1, 2 failed to reach a quorum, which is
rather puzzling.
All hosts are pretty much idle all the time so I
can't see why monitors
would be getting stuck.
Grab 'ceph daemon mon.<ID> mon_status' from all
monitors, set 'debug mon = 10', 'debug paxos = 10',
'debug ms = 1', grab logs from the monitors for the
periods before, during and after election (ideally for
at least two election cycles). Put them up somewhere on
the interwebs and send us a link.
If you don't feel comfortable putting that link in the
list, send me an email directly with the url.
I'll be happy to take a look later tonight.
-Joao
On Mon, Jul 25, 2016 at 5:18 PM, Joao Eduardo Luis
<joao@xxxxxxx <mailto:joao@xxxxxxx>
<tel:09.792511 <tel:09.792511>> 7fc1b3f7e700 1On 07/25/2016 04:34 PM, Sergio A. de Carvalho
Jr. wrote:
Thanks, Joao.
All monitors have the exact same mom map.
I suspect you're right that there might be
some communication
problem
though. I stopped monitor 1 (60zxl02), but
the other 3 monitors
still
failed to reach a quorum. I could see
monitor 0 was still declaring
victory but the others were always calling
for new elections:
2016-07-25 15:18:59.775144 7f8760af7700 0
log_channel(cluster) log
[INF] : mon.60z0m02@0 won leader election
with quorum 0,2,4
2016-07-25 15:18:54.702176 7fc1b357d700 1
mon.610wl02@2(electing) e5
handle_timecheck drop unexpected msg
2016-07-25 15:18:54.704526 7fc1b357d700 1
mon.610wl02@2(electing).data_health(11626)
service_dispatch not in
quorum -- drop message
2016-07-25 15:19:09.792511 <tel:09.792511>
mon.610wl02@2(peon).paxos(paxos recovering
c 1318755..1319322)
lease_timeout -- calling new election
2016-07-25 15:19:09.792825 <tel:09.792825>
<tel:09.792825 <tel:09.792825>> 7fc1b357d700 0
log_channel(cluster) log
[INF] : mon.610wl02 calling new monitor
election
I'm curious about the "handle_timecheck
drop unexpected msg"
message.
timechecks (i.e., checking for clock skew), as
well as the
data_health service (which makes sure you have
enough disk space in
the mon data dir) are only run when you have a
quorum. If a message
is received by a monitor not in a quorum,
regardless of state, it
will be dropped.
Assuming you know took one of the
self-appointed leaders out - by
shutting it down, for instance -, you should
now check what's
causing elections not to hold.
In these cases, assuming your 3 monitors do
form a quorum, the
traditional issue tends to be 'lease timeouts'.
I.e., the leader
fails to provide a lease extension on paxos for
the peons, and the
peons assume the leader failed in some form
(unresponsive, down,
whatever).
Above it does seem a lease timeout was
triggered on a peon. This may
have happened because:
1. leader did not extend the lease
2. leader did extend the lease but lease was in
the past - usually
indication of a clock skew on the leader, on
the peons, or both.
3. leader did extend the lease, it was with the
correct time but
peon failed to dispatch the message on time.
Both 1. and 2. may be due to several factors,
but most commonly it's
because the monitor was stuck doing something.
This something, more
often than not, is leveldb. If this is the
case, check the size of
your leveldb. If it is over 5 or 6GB in size,
you may need to
manually compact the store (mon compact on
start = true, iirc).
HTH
-Joao
On Mon, Jul 25, 2016 at 4:10 PM, Joao
Eduardo Luis <joao@xxxxxxx <mailto:joao@xxxxxxx>
<mailto:joao@xxxxxxx <mailto:joao@xxxxxxx>>
<mailto:joao@xxxxxxx <mailto:joao@xxxxxxx>
<mailto:joao@xxxxxxx <mailto:joao@xxxxxxx>>>> wrote:
On 07/25/2016 03:41 PM, Sergio A. de
Carvalho Jr. wrote:
In the logs, there 2 monitors are
constantly reporting
that they
won the
leader election:
60z0m02 (monitor 0):
2016-07-25 14:31:11.644335
7f8760af7700 0
log_channel(cluster) log
[INF] : mon.60z0m02@0 won leader
election with quorum 0,2,4
2016-07-25 14:31:44.521552
7f8760af7700 1
mon.60z0m02@0(leader).paxos(paxos
recovering c
1318755..1319320)
collect
timeout, calling fresh election
60zxl02 (monitor 1):
2016-07-25 14:31:59.542346
7fefdeaed700 1
mon.60zxl02@1(electing).elector(11441) init, last seen
epoch 11441
2016-07-25 14:32:04.583929
7fefdf4ee700 0
log_channel(cluster) log
[INF] : mon.60zxl02@1 won leader
election with quorum 1,2,4
2016-07-25 14:32:33.440103
7fefdf4ee700 1
mon.60zxl02@1(leader).paxos(paxos
recovering c
1318755..1319319)
collect
timeout, calling fresh election
There are two likely scenarios to
explain this:
1. The monitors have different
monitors in their monmaps - this
could happen if you didn't add the new
monitor via 'ceph
mon add'.
You can check this by running 'ceph
daemon mon.<ID>
mon_status' for
each of the monitors in the cluster.
2. some of the monitors are unable to
communicate with each
other,
thus will never acknowledge the same
leader. This does not
mean you
have two leaders for the same cluster,
but it does mean
that you
will end up having two monitors
declaring victory and
become the
self-proclaimed leader in the cluster.
The peons should
still only
belong to one quorum.
If this does not help you, try setting
'debug mon = 10' and
'debug
ms = 1' on the monitors and check the
logs, making sure the
monitors
get the probes and follow the election
process. If you need
further
assistance, put those logs online
somewhere we can access
them and
we'll try to help you out.
-Joao
On Mon, Jul 25, 2016 at 3:27 PM,
Sergio A. de Carvalho Jr.
<scarvalhojr@xxxxxxxxx
<mailto:scarvalhojr@xxxxxxxxx>
<mailto:scarvalhojr@xxxxxxxxx
<mailto:scarvalhojr@xxxxxxxxx>>
<mailto:scarvalhojr@xxxxxxxxx
<mailto:scarvalhojr@xxxxxxxxx>
<mailto:scarvalhojr@xxxxxxxxx
<mailto:scarvalhojr@xxxxxxxxx>>>
<mailto:scarvalhojr@xxxxxxxxx
<mailto:scarvalhojr@xxxxxxxxx>
<mailto:scarvalhojr@xxxxxxxxx
<mailto:scarvalhojr@xxxxxxxxx>>
<mailto:scarvalhojr@xxxxxxxxx
<mailto:scarvalhojr@xxxxxxxxx>
<mailto:scarvalhojr@xxxxxxxxx
<mailto:scarvalhojr@xxxxxxxxx>>>>>
wrote:
Hi,
I have a cluster of 5 hosts
running Ceph 0.94.6 on
CentOS
6.5. On
each host, there is 1 monitor
and 13 OSDs. We had
an issue
with the
network and for some reason
(which I still don't
know why), the
servers were restarted. One
host is still down,
but the
monitors on
the 4 remaining servers are
failing to enter a quorum.
I managed to get a quorum of
3 monitors by
stopping all Ceph
monitors and OSDs across all
machines, and
bringing up the
top 3
ranked monitors in order of
rank. After a few
minutes, the
60z0m02
monitor (the top ranked one)
became the leader:
{
"name": "60z0m02",
"rank": 0,
"state": "leader",
"election_epoch": 11328,
"quorum": [
0,
1,
2
],
"outside_quorum": [],
"extra_probe_peers": [],
"sync_provider": [],
"monmap": {
"epoch": 5,
"fsid":
"2f51a247-3155-4bcf-9aee-c6f6b2c5e2af",
"modified":
"2016-04-28 22:26:48.604393",
"created": "0.000000",
"mons": [
{
"rank": 0,
"name":
"60z0m02",
"addr":
"10.98.2.166:6789 <http://10.98.2.166:6789>
<http://10.98.2.166:6789>
<http://10.98.2.166:6789>
<http://10.98.2.166:6789>\/0"
},
{
"rank": 1,
"name":
"60zxl02",
"addr":
"10.98.2.167:6789 <http://10.98.2.167:6789>
<http://10.98.2.167:6789>
<http://10.98.2.167:6789>
<http://10.98.2.167:6789>\/0"
},
{
"rank": 2,
"name":
"610wl02",
"addr":
"10.98.2.173:6789 <http://10.98.2.173:6789>
<http://10.98.2.173:6789>
<http://10.98.2.173:6789>
<http://10.98.2.173:6789>\/0"
},
{
"rank": 3,
"name":
"618yl02",
"addr":
"10.98.2.214:6789 <http://10.98.2.214:6789>
<http://10.98.2.214:6789>
<http://10.98.2.214:6789>
<http://10.98.2.214:6789>\/0"
},
{
"rank": 4,
"name":
"615yl02",
"addr":
"10.98.2.216:6789 <http://10.98.2.216:6789>
<http://10.98.2.216:6789>
<http://10.98.2.216:6789>
<http://10.98.2.216:6789>\/0"
}
]
}
}
The other 2 monitors became
peons:
"name": "60zxl02",
"rank": 1,
"state": "peon",
"election_epoch": 11328,
"quorum": [
0,
1,
2
],
"name": "610wl02",
"rank": 2,
"state": "peon",
"election_epoch": 11328,
"quorum": [
0,
1,
2
],
I then proceeded to start the
fourth monitor, 615yl02
(618yl02 is
powered off), but after more
than 2 hours and
several election
rounds, the monitors still
haven't reached a
quorum. The
monitors
alternate mostly between
"election", "probing"
states but
they often
seem to be in different
election epochs.
Is this normal?
Is there anything I can do to
help the monitors
elect a leader?
Should I manually remove the
dead host's monitor
from the
monitor map?
I left all OSD daemons
stopped while the election
is going on
purpose. Is this the best
thing to do? Would
bringing the
OSDs up
help or complicate matters
even more? Or doesn't
it make
any difference?
I don't see anything
obviously wrong in the
monitor logs.
They're
mostly filled with messages
like the following:
2016-07-25 14:17:57.806148
7fc1b3f7e700 1
mon.610wl02@2(electing).elector(11411) init, last seen
epoch 11411
2016-07-25 14:17:57.829198
7fc1b7caf700 0
log_channel(audit) log
[DBG] : from='admin socket'
entity='admin socket'
cmd='mon_status'
args=[]: dispatch
2016-07-25 14:17:57.829200
7fc1b7caf700 0
log_channel(audit) do_log
log to syslog
2016-07-25 14:17:57.829254
7fc1b7caf700 0
log_channel(audit) log
[DBG] : from='admin socket'
entity='admin socket'
cmd=mon_status
args=[]: finished
Any help would be hugely
appreciated.
Thanks,
Sergio
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
<mailto:ceph-users@xxxxxxxxxxxxxx>
<mailto:ceph-users@xxxxxxxxxxxxxx
<mailto:ceph-users@xxxxxxxxxxxxxx>>
<mailto:ceph-users@xxxxxxxxxxxxxx
<mailto:ceph-users@xxxxxxxxxxxxxx>
<mailto:ceph-users@xxxxxxxxxxxxxx
<mailto:ceph-users@xxxxxxxxxxxxxx>>>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
<mailto:ceph-users@xxxxxxxxxxxxxx>
<mailto:ceph-users@xxxxxxxxxxxxxx
<mailto:ceph-users@xxxxxxxxxxxxxx>>
<mailto:ceph-users@xxxxxxxxxxxxxx
<mailto:ceph-users@xxxxxxxxxxxxxx>
<mailto:ceph-users@xxxxxxxxxxxxxx
<mailto:ceph-users@xxxxxxxxxxxxxx>>>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Sean CrosbyResearch ComputingARC Centre of Excellence for Particle Physics at the TerascaleSchool of Physics | University of Melbourne VIC 3010 Australia
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com