Hi,
On 24 May 2016, at 09:16, Christian Balzer < chibi@xxxxxxx> wrote:
Hello,On Tue, 24 May 2016 07:03:25 +0000 Josef Johansson wrote:Hi,
You need to monitor latency instead of peak points. As Ceph is writing to two other nodes if you have 3 replicas that is 4x extra the latency compared to one roundtrip to the first OSD from client. So smaller and more IO equals more pain in latency.
While very true, I don't think latency (as in network Ceph code related)is causing his problems.30+ second slow requests tend to be nearly exclusively the domain of theI/O system being unable to keep up.And I'm pretty sure it's the HDD based EC pool.Not just monitoring Ceph counters but the actual disk stats can behelpful, as keeping in mind that it takes only one slow OSD, node, wonkylink to bring everything to a standstill.
I agree, more counters to identify the problem. Any way nowadays to automatically find out what OSD that is causing the slow requests? Like, a graphical way of doing it :)
So another Subject line for that issue. That explains it.
Regards, Josef ChristianRegards, Josef
On Tue, 24 May 2016, 08:49 Peter Kerdisle, <peter.kerdisle@xxxxxxxxx> wrote:
Hey Christian,
I honestly haven't seen any replies to my earlier message. I will traverse my email and make sure I find it, my apologies.
I am graphing everything with collectd and graphite this is what makes it so frustrating since I am not seeing any obvious pain points.
I am basically using the pool in read-forward now so there should be almost no promotion from EC to the SSD pool. I will see what options I have for adding some SSD journals to the OSD nodes to help speed things along.
Thanks, and apologies again for missing your earlier replies.
Peter
On Tue, May 24, 2016 at 4:25 AM, Christian Balzer <chibi@xxxxxxx> wrote:
Hello,
On Mon, 23 May 2016 10:45:41 +0200 Peter Kerdisle wrote:
Hey,
Sadly I'm still battling this issue. I did notice one interesting thing.
I changed the cache settings for my cache tier to add redundancy to the pool which means a lot of recover activity on the cache. During all this there were absolutely no slow requests reported. Is there anything I can conclude from that information? Is it possible that not having any SSDs for journals could be the bottleneck on my erasure pool and that's generating slow requests?
That's what I suggested in the first of my 3 replies to your original thread which seemingly got ignored judging by the lack of a reply.
I simply can't imagine why a request can be blocked for 30 or even 60 seconds. It's getting really frustrating not being able to fix this and
I
simply don't know what else I can do at this point.
Are you monitoring your cluster, especially the HDD nodes? Permanently with collectd and graphite (or similar) and topically (especially during tests) with atop (or iostat, etc)?
And one more time, your cache tier can only help you if it is fast and large enough to sufficiently dis-engage you from your slow backing storage. And lets face it EC pool AND no journal SSDs will be quite slow.
So if your cache is dirty all the time and has to flush (causing IO on the backing storage) while there is also promotion from the backing storage to the cache going on you're basically down to the base speed of your EC pool, at least for some of your ops.
Christian
If anybody has anything I haven't tried before please let me know.
Peter
On Thu, May 5, 2016 at 10:30 AM, Peter Kerdisle <peter.kerdisle@xxxxxxxxx> wrote:
Hey guys,
I'm running into an issue with my cluster during high activity.
I have two SSD cache servers (2 SSDs for journals, 7 SSDs for data) with 2x10Gbit bonded each and a six OSD nodes with a 10Gbit public and 10Gbit cluster network for the erasure pool (10x3TB without separate journal). This is all on Jewel.
It's working fine with normal load. However when I force increased activity by lowering the cache_target_dirty_ratio to make sure my files are promoted things start to go amiss.
To give an example: http://pastie.org/private/5k5ml6a8gqkivjshgjcedg
This is especially concering: pgs: 9 activating+undersized+degraded, 48 active+undersized+degraded, 1 stale+active+clean, 27 peering.
Here is an other minute or so where I grepped for warnings: http://pastie.org/private/bfv3kxl63cfcduafoaurog
These warnings are generated all over the OSD nodes, not specifically one OSD or even one node.
During this time the different OSD logs show various warnings:
2016-05-05 10:04:10.873603 7f1afaf3b700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af3f2d700' had timed out after 15 2016-05-05 10:04:10.873605 7f1afaf3b700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out after 15 2016-05-05 10:04:10.905997 7f1afc73e700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af3f2d700' had timed out after 15 2016-05-05 10:04:10.906000 7f1afc73e700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out after 15 2016-05-05 10:04:10.906022 7f1afaf3b700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af3f2d700' had timed out after 15 2016-05-05 10:04:10.906027 7f1afaf3b700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out after 15 2016-05-05 10:04:10.949894 7f1af3f2d700 1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f1af3f2d700' had timed out after 15 2016-05-05 10:04:10.956801 7f1afc73e700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out after 15 2016-05-05 10:04:10.956833 7f1afaf3b700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out after 15 2016-05-05 10:04:10.957816 7f1af5f31700 1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out after 15
or
2016-05-05 10:03:27.269658 7f98cca35700 -1 osd.6 7235 heartbeat_check: no reply from osd.25 since back 2016-05-05 10:03:06.566276 front 2016-05-05 10:03:06.566276 (cutoff 2016-05-05 10:03:07.269651) 2016-05-05 10:03:28.269838 7f98cca35700 -1 osd.6 7235 heartbeat_check: no reply from osd.25 since back 2016-05-05 10:03:06.566276 front 2016-05-05 10:03:06.566276 (cutoff 2016-05-05 10:03:08.269831) 2016-05-05 10:03:29.269998 7f98cca35700 -1 osd.6 7235 heartbeat_check: no reply from osd.25 since back 2016-05-05 10:03:06.566276 front 2016-05-05 10:03:06.566276 (cutoff 2016-05-05 10:03:09.269992) 2016-05-05 10:03:29.801145 7f98b339a700 -1 osd.6 7235 heartbeat_check: no reply from osd.25 since back 2016-05-05 10:03:06.566276 front 2016-05-05 10:03:06.566276 (cutoff 2016-05-05 10:03:09.801142) 2016-05-05 10:04:06.275237 7f98cca35700 0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 30.910385 secs 2016-05-05 10:04:06.275252 7f98cca35700 0 log_channel(cluster) log [WRN] : slow request 30.910385 seconds old, received at 2016-05-05 10:03:35.364796: osd_op(osd.70.6640:6555588 4.1041c254 rbd_data.afd7564682858.0000000000014daa [copy-from ver 105993] snapc 0=[] ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e7235) currently commit_sent
I've followed the instructions on
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/
to hopefully find out what's happening but as far as the hardware is concerned everything looks fine. No smart errors logged, iostats shows some activity but nothing pegged to 100%, no messages in dmesg and the cpus are only used for around 25% max.
This makes me think it might be network related. I have a 10Gbit
public
and 10Gbit cluster network neither of which seems to hit any limits. There is one thing that might be a problem and that is that one of the cache node has a bonded interface and no access to the cluster network, and the other cache node has a public and cluster interface.
Could anybody give me some more steps I can take to further discover where this bottleneck lies?
Thanks in advance,
Peter
-- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
-- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communicationshttp://www.gol.com/
|