Re: Diagnosing slow requests

Josef Johansson <josef86@xxxxxxxxx> · Tue, 24 May 2016 12:05:13 +0200

Hi,
On 24 May 2016, at 09:16, Christian Balzer <chibi@xxxxxxx> wrote:

Hello,

On Tue, 24 May 2016 07:03:25 +0000 Josef Johansson wrote:

Hi,

You need to monitor latency instead of peak points. As Ceph is writing to
two other nodes if you have 3 replicas that is 4x extra the latency
compared to one roundtrip to the first OSD from client. So smaller and
more IO equals more pain in latency.

While very true, I don't think latency (as in network Ceph code related)
is causing his problems.

30+ second slow requests tend to be nearly exclusively the domain of the
I/O system being unable to keep up.
And I'm pretty sure it's the HDD based EC pool.

Not just monitoring Ceph counters but the actual disk stats can be
helpful, as keeping in mind that it takes only one slow OSD, node, wonky
link to bring everything to a standstill. 

I agree, more counters to identify the problem. Any way nowadays to automatically find out what OSD that is causing the slow requests? Like, a graphical way of doing it :)

And the worst thing is that there is nothing that actually shows this
AFAIK, ceph osd perf shows some latency. A slow CPU could hamper
performance even if it's showing no sign of it.

I believe you can see which operations is running right now and see where
they are waiting for, I think there's a thread on this ML regarding dead
lock and skow requests that could be interesting.

I did not see Christians responses either so maybe not a problem with
your client.

The list (archive) and me sure did:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-May/009464.html
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-May/009744.html
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-May/009747.html

So another Subject line for that issue. That explains it.

Regards,
Josef

Christian

Regards,
Josef

On Tue, 24 May 2016, 08:49 Peter Kerdisle, <peter.kerdisle@xxxxxxxxx>
wrote:

Hey Christian,

I honestly haven't seen any replies to my earlier message. I will
traverse my email and make sure I find it, my apologies.

I am graphing everything with collectd and graphite this is what makes
it so frustrating since I am not seeing any obvious pain points.

I am basically using the pool in read-forward now so there should be
almost no promotion from EC to the SSD pool. I will see what options I
have for adding some SSD journals to the OSD nodes to help speed
things along.

Thanks, and apologies again for missing your earlier replies.

Peter

On Tue, May 24, 2016 at 4:25 AM, Christian Balzer <chibi@xxxxxxx>
wrote:

Hello,

On Mon, 23 May 2016 10:45:41 +0200 Peter Kerdisle wrote:

Hey,

Sadly I'm still battling this issue. I did notice one interesting
thing.

I changed the cache settings for my cache tier to add redundancy to
the pool which means a lot of recover activity on the cache. During
all this there were absolutely no slow requests reported. Is there
anything I can conclude from that information? Is it possible that
not having any SSDs for journals could be the bottleneck on my
erasure pool and that's generating slow requests?

That's what I suggested in the first of my 3 replies to your original
thread which seemingly got ignored judging by the lack of a reply.

I simply can't imagine why a request can be blocked for 30 or even
60 seconds. It's getting really frustrating not being able to fix
this and
I
simply don't know what else I can do at this point.

Are you monitoring your cluster, especially the HDD nodes?
Permanently with collectd and graphite (or similar) and topically
(especially during tests) with atop (or iostat, etc)?

And one more time, your cache tier can only help you if it is fast and
large enough to sufficiently dis-engage you from your slow backing
storage.
And lets face it EC pool AND no journal SSDs will be quite slow.

So if your cache is dirty all the time and has to flush (causing IO
on the backing storage) while there is also promotion from the
backing storage to the cache going on you're basically down to the
base speed of your EC pool, at least for some of your ops.

Christian

If anybody has anything I haven't tried before please let me know.

Peter

On Thu, May 5, 2016 at 10:30 AM, Peter Kerdisle
<peter.kerdisle@xxxxxxxxx> wrote:

Hey guys,

I'm running into an issue with my cluster during high activity.

I have two SSD cache servers (2 SSDs for journals, 7 SSDs for
data) with 2x10Gbit bonded each and a six OSD nodes with a 10Gbit
public and 10Gbit cluster network for the erasure pool (10x3TB
without separate journal). This is all on Jewel.

It's working fine with normal load. However when I force increased
activity by lowering the cache_target_dirty_ratio to make sure my
files are promoted things start to go amiss.

To give an example:
http://pastie.org/private/5k5ml6a8gqkivjshgjcedg

This is especially concering:  pgs: 9
activating+undersized+degraded, 48 active+undersized+degraded, 1
stale+active+clean, 27 peering.

Here is an other minute or so where I grepped for warnings:
http://pastie.org/private/bfv3kxl63cfcduafoaurog

These warnings are generated all over the OSD nodes, not
specifically one OSD or even one node.

During this time the different OSD logs show various warnings:

2016-05-05 10:04:10.873603 7f1afaf3b700  1 heartbeat_map
is_healthy 'OSD::osd_op_tp thread 0x7f1af3f2d700' had timed out
after 15 2016-05-05 10:04:10.873605 7f1afaf3b700  1 heartbeat_map
is_healthy 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out
after 15 2016-05-05 10:04:10.905997 7f1afc73e700  1 heartbeat_map
is_healthy 'OSD::osd_op_tp thread 0x7f1af3f2d700' had timed out
after 15 2016-05-05 10:04:10.906000 7f1afc73e700  1 heartbeat_map
is_healthy 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out
after 15 2016-05-05 10:04:10.906022 7f1afaf3b700  1 heartbeat_map
is_healthy 'OSD::osd_op_tp thread 0x7f1af3f2d700' had timed out
after 15 2016-05-05 10:04:10.906027 7f1afaf3b700  1 heartbeat_map
is_healthy 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out
after 15 2016-05-05 10:04:10.949894 7f1af3f2d700  1 heartbeat_map
reset_timeout 'OSD::osd_op_tp thread 0x7f1af3f2d700' had timed
out after 15 2016-05-05 10:04:10.956801 7f1afc73e700  1
heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af5f31700'
had timed out after 15 2016-05-05 10:04:10.956833 7f1afaf3b700  1
heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af5f31700'
had timed out after 15 2016-05-05 10:04:10.957816 7f1af5f31700  1
heartbeat_map reset_timeout 'OSD::osd_op_tp thread
0x7f1af5f31700' had timed out after 15

or

2016-05-05 10:03:27.269658 7f98cca35700 -1 osd.6 7235
heartbeat_check: no reply from osd.25 since back 2016-05-05
10:03:06.566276 front 2016-05-05 10:03:06.566276 (cutoff
2016-05-05 10:03:07.269651) 2016-05-05 10:03:28.269838
7f98cca35700 -1 osd.6 7235 heartbeat_check: no reply from osd.25
since back 2016-05-05 10:03:06.566276 front 2016-05-05
10:03:06.566276 (cutoff 2016-05-05 10:03:08.269831) 2016-05-05
10:03:29.269998 7f98cca35700 -1 osd.6 7235 heartbeat_check: no
reply from osd.25 since back 2016-05-05 10:03:06.566276 front
2016-05-05 10:03:06.566276 (cutoff 2016-05-05 10:03:09.269992)
2016-05-05 10:03:29.801145 7f98b339a700 -1 osd.6 7235
heartbeat_check: no reply from osd.25 since back 2016-05-05
10:03:06.566276 front 2016-05-05 10:03:06.566276 (cutoff
2016-05-05 10:03:09.801142) 2016-05-05 10:04:06.275237
7f98cca35700  0 log_channel(cluster) log [WRN] : 1 slow requests,
1 included below; oldest blocked for > 30.910385 secs 2016-05-05
10:04:06.275252 7f98cca35700  0 log_channel(cluster) log [WRN] :
slow request 30.910385 seconds old, received at 2016-05-05
10:03:35.364796: osd_op(osd.70.6640:6555588 4.1041c254
rbd_data.afd7564682858.0000000000014daa [copy-from ver 105993]
snapc 0=[]
ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected
e7235) currently commit_sent

I've followed the instructions on

http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/
to hopefully find out what's happening but as far as the hardware
is concerned everything looks fine. No smart errors logged,
iostats shows some activity but nothing pegged to 100%, no
messages in dmesg and the cpus are only used for around 25% max.

This makes me think it might be network related. I have a 10Gbit
public
and 10Gbit cluster network neither of which seems to hit any
limits. There is one thing that might be a problem and that is
that one of the cache node has a bonded interface and no access
to the cluster network, and the other cache node has a public and
cluster interface.

Could anybody give me some more steps I can take to further
discover where this bottleneck lies?

Thanks in advance,

Peter

--
Christian Balzer        Network/Systems Engineer
chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
http://www.gol.com/

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com