Diagnosing slow requests

Peter Kerdisle <peter.kerdisle@xxxxxxxxx> · Thu, 5 May 2016 10:30:03 +0200

Hey guys,
I'm running into an issue with my cluster during high activity. 

I have two SSD cache servers (2 SSDs for journals, 7 SSDs for data) with 2x10Gbit bonded each and a six OSD nodes with a 10Gbit public and 10Gbit cluster network for the erasure pool (10x3TB without separate journal). This is all on Jewel.

It's working fine with normal load. However when I force increased activity by lowering the cache_target_dirty_ratio to make sure my files are promoted things start to go amiss. 

To give an example: http://pastie.org/private/5k5ml6a8gqkivjshgjcedg

This is especially concering:  pgs: 9 activating+undersized+degraded, 48 active+undersized+degraded, 1 stale+active+clean, 27 peering.

Here is an other minute or so where I grepped for warnings: http://pastie.org/private/bfv3kxl63cfcduafoaurog

These warnings are generated all over the OSD nodes, not specifically one OSD or even one node.

During this time the different OSD logs show various warnings:

2016-05-05 10:04:10.873603 7f1afaf3b700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af3f2d700' had timed out after 15
2016-05-05 10:04:10.873605 7f1afaf3b700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out after 15
2016-05-05 10:04:10.905997 7f1afc73e700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af3f2d700' had timed out after 15
2016-05-05 10:04:10.906000 7f1afc73e700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out after 15
2016-05-05 10:04:10.906022 7f1afaf3b700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af3f2d700' had timed out after 15
2016-05-05 10:04:10.906027 7f1afaf3b700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out after 15
2016-05-05 10:04:10.949894 7f1af3f2d700  1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f1af3f2d700' had timed out after 15
2016-05-05 10:04:10.956801 7f1afc73e700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out after 15
2016-05-05 10:04:10.956833 7f1afaf3b700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out after 15
2016-05-05 10:04:10.957816 7f1af5f31700  1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out after 15

or

2016-05-05 10:03:27.269658 7f98cca35700 -1 osd.6 7235 heartbeat_check: no reply from osd.25 since back 2016-05-05 10:03:06.566276 front 2016-05-05 10:03:06.566276 (cutoff 2016-05-05 10:03:07.269651)
2016-05-05 10:03:28.269838 7f98cca35700 -1 osd.6 7235 heartbeat_check: no reply from osd.25 since back 2016-05-05 10:03:06.566276 front 2016-05-05 10:03:06.566276 (cutoff 2016-05-05 10:03:08.269831)
2016-05-05 10:03:29.269998 7f98cca35700 -1 osd.6 7235 heartbeat_check: no reply from osd.25 since back 2016-05-05 10:03:06.566276 front 2016-05-05 10:03:06.566276 (cutoff 2016-05-05 10:03:09.269992)
2016-05-05 10:03:29.801145 7f98b339a700 -1 osd.6 7235 heartbeat_check: no reply from osd.25 since back 2016-05-05 10:03:06.566276 front 2016-05-05 10:03:06.566276 (cutoff 2016-05-05 10:03:09.801142)
2016-05-05 10:04:06.275237 7f98cca35700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 30.910385 secs
2016-05-05 10:04:06.275252 7f98cca35700  0 log_channel(cluster) log [WRN] : slow request 30.910385 seconds old, received at 2016-05-05 10:03:35.364796: osd_op(osd.70.6640:6555588 4.1041c254 rbd_data.afd7564682858.0000000000014daa [copy-from ver 105993] snapc 0=[] ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e7235) currently commit_sent

I've followed the instructions on http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/ to hopefully find out what's happening but as far as the hardware is concerned everything looks fine. No smart errors logged, iostats shows some activity but nothing pegged to 100%, no messages in dmesg and the cpus are only used for around 25% max. 

This makes me think it might be network related. I have a 10Gbit public and 10Gbit cluster network neither of which seems to hit any limits. There is one thing that might be a problem and that is that one of the cache node has a bonded interface and no access to the cluster network, and the other cache node has a public and cluster interface.

Could anybody give me some more steps I can take to further discover where this bottleneck lies?

Thanks in advance,

Peter
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com