Hey,
Sadly I'm still battling this issue. I did notice one interesting thing.
I changed the cache settings for my cache tier to add redundancy to the pool which means a lot of recover activity on the cache. During all this there were absolutely no slow requests reported. Is there anything I can conclude from that information? Is it possible that not having any SSDs for journals could be the bottleneck on my erasure pool and that's generating slow requests?
I simply can't imagine why a request can be blocked for 30 or even 60 seconds. It's getting really frustrating not being able to fix this and I simply don't know what else I can do at this point.
If anybody has anything I haven't tried before please let me know.
Peter
On Thu, May 5, 2016 at 10:30 AM, Peter Kerdisle <peter.kerdisle@xxxxxxxxx> wrote:
Hey guys,I'm running into an issue with my cluster during high activity.I have two SSD cache servers (2 SSDs for journals, 7 SSDs for data) with 2x10Gbit bonded each and a six OSD nodes with a 10Gbit public and 10Gbit cluster network for the erasure pool (10x3TB without separate journal). This is all on Jewel.It's working fine with normal load. However when I force increased activity by lowering the cache_target_dirty_ratio to make sure my files are promoted things start to go amiss.To give an example: http://pastie.org/private/5k5ml6a8gqkivjshgjcedgThis is especially concering: pgs: 9 activating+undersized+degraded, 48 active+undersized+degraded, 1 stale+active+clean, 27 peering.Here is an other minute or so where I grepped for warnings: http://pastie.org/private/bfv3kxl63cfcduafoaurogThese warnings are generated all over the OSD nodes, not specifically one OSD or even one node.During this time the different OSD logs show various warnings:2016-05-05 10:04:10.873603 7f1afaf3b700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af3f2d700' had timed out after 152016-05-05 10:04:10.873605 7f1afaf3b700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out after 152016-05-05 10:04:10.905997 7f1afc73e700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af3f2d700' had timed out after 152016-05-05 10:04:10.906000 7f1afc73e700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out after 152016-05-05 10:04:10.906022 7f1afaf3b700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af3f2d700' had timed out after 152016-05-05 10:04:10.906027 7f1afaf3b700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out after 152016-05-05 10:04:10.949894 7f1af3f2d700 1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f1af3f2d700' had timed out after 152016-05-05 10:04:10.956801 7f1afc73e700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out after 152016-05-05 10:04:10.956833 7f1afaf3b700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out after 152016-05-05 10:04:10.957816 7f1af5f31700 1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out after 15or2016-05-05 10:03:27.269658 7f98cca35700 -1 osd.6 7235 heartbeat_check: no reply from osd.25 since back 2016-05-05 10:03:06.566276 front 2016-05-05 10:03:06.566276 (cutoff 2016-05-05 10:03:07.269651)2016-05-05 10:03:28.269838 7f98cca35700 -1 osd.6 7235 heartbeat_check: no reply from osd.25 since back 2016-05-05 10:03:06.566276 front 2016-05-05 10:03:06.566276 (cutoff 2016-05-05 10:03:08.269831)2016-05-05 10:03:29.269998 7f98cca35700 -1 osd.6 7235 heartbeat_check: no reply from osd.25 since back 2016-05-05 10:03:06.566276 front 2016-05-05 10:03:06.566276 (cutoff 2016-05-05 10:03:09.269992)2016-05-05 10:03:29.801145 7f98b339a700 -1 osd.6 7235 heartbeat_check: no reply from osd.25 since back 2016-05-05 10:03:06.566276 front 2016-05-05 10:03:06.566276 (cutoff 2016-05-05 10:03:09.801142)2016-05-05 10:04:06.275237 7f98cca35700 0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 30.910385 secs2016-05-05 10:04:06.275252 7f98cca35700 0 log_channel(cluster) log [WRN] : slow request 30.910385 seconds old, received at 2016-05-05 10:03:35.364796: osd_op(osd.70.6640:6555588 4.1041c254 rbd_data.afd7564682858.0000000000014daa [copy-from ver 105993] snapc 0=[] ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e7235) currently commit_sentI've followed the instructions on http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/ to hopefully find out what's happening but as far as the hardware is concerned everything looks fine. No smart errors logged, iostats shows some activity but nothing pegged to 100%, no messages in dmesg and the cpus are only used for around 25% max.This makes me think it might be network related. I have a 10Gbit public and 10Gbit cluster network neither of which seems to hit any limits. There is one thing that might be a problem and that is that one of the cache node has a bonded interface and no access to the cluster network, and the other cache node has a public and cluster interface.Could anybody give me some more steps I can take to further discover where this bottleneck lies?Thanks in advance,
Peter
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com