Hey guys,
Peter
I'm running into an issue with my cluster during high activity.
I have two SSD cache servers (2 SSDs for journals, 7 SSDs for data) with 2x10Gbit bonded each and a six OSD nodes with a 10Gbit public and 10Gbit cluster network for the erasure pool (10x3TB without separate journal). This is all on Jewel.
It's working fine with normal load. However when I force increased activity by lowering the cache_target_dirty_ratio to make sure my files are promoted things start to go amiss.
To give an example: http://pastie.org/private/5k5ml6a8gqkivjshgjcedg
This is especially concering: pgs: 9 activating+undersized+degraded, 48 active+undersized+degraded, 1 stale+active+clean, 27 peering.
Here is an other minute or so where I grepped for warnings: http://pastie.org/private/bfv3kxl63cfcduafoaurog
These warnings are generated all over the OSD nodes, not specifically one OSD or even one node.
During this time the different OSD logs show various warnings:
2016-05-05 10:04:10.873603 7f1afaf3b700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af3f2d700' had timed out after 15
2016-05-05 10:04:10.873605 7f1afaf3b700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out after 15
2016-05-05 10:04:10.905997 7f1afc73e700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af3f2d700' had timed out after 15
2016-05-05 10:04:10.906000 7f1afc73e700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out after 15
2016-05-05 10:04:10.906022 7f1afaf3b700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af3f2d700' had timed out after 15
2016-05-05 10:04:10.906027 7f1afaf3b700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out after 15
2016-05-05 10:04:10.949894 7f1af3f2d700 1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f1af3f2d700' had timed out after 15
2016-05-05 10:04:10.956801 7f1afc73e700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out after 15
2016-05-05 10:04:10.956833 7f1afaf3b700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out after 15
2016-05-05 10:04:10.957816 7f1af5f31700 1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out after 15
or
2016-05-05 10:03:27.269658 7f98cca35700 -1 osd.6 7235 heartbeat_check: no reply from osd.25 since back 2016-05-05 10:03:06.566276 front 2016-05-05 10:03:06.566276 (cutoff 2016-05-05 10:03:07.269651)
2016-05-05 10:03:28.269838 7f98cca35700 -1 osd.6 7235 heartbeat_check: no reply from osd.25 since back 2016-05-05 10:03:06.566276 front 2016-05-05 10:03:06.566276 (cutoff 2016-05-05 10:03:08.269831)
2016-05-05 10:03:29.269998 7f98cca35700 -1 osd.6 7235 heartbeat_check: no reply from osd.25 since back 2016-05-05 10:03:06.566276 front 2016-05-05 10:03:06.566276 (cutoff 2016-05-05 10:03:09.269992)
2016-05-05 10:03:29.801145 7f98b339a700 -1 osd.6 7235 heartbeat_check: no reply from osd.25 since back 2016-05-05 10:03:06.566276 front 2016-05-05 10:03:06.566276 (cutoff 2016-05-05 10:03:09.801142)
2016-05-05 10:04:06.275237 7f98cca35700 0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 30.910385 secs
2016-05-05 10:04:06.275252 7f98cca35700 0 log_channel(cluster) log [WRN] : slow request 30.910385 seconds old, received at 2016-05-05 10:03:35.364796: osd_op(osd.70.6640:6555588 4.1041c254 rbd_data.afd7564682858.0000000000014daa [copy-from ver 105993] snapc 0=[] ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e7235) currently commit_sent
I've followed the instructions on http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/ to hopefully find out what's happening but as far as the hardware is concerned everything looks fine. No smart errors logged, iostats shows some activity but nothing pegged to 100%, no messages in dmesg and the cpus are only used for around 25% max.
This makes me think it might be network related. I have a 10Gbit public and 10Gbit cluster network neither of which seems to hit any limits. There is one thing that might be a problem and that is that one of the cache node has a bonded interface and no access to the cluster network, and the other cache node has a public and cluster interface.
Could anybody give me some more steps I can take to further discover where this bottleneck lies?
Thanks in advance,
Peter
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com