But that is already the default not? (on CentOS7 rpms) [@c03 ~]# cat /etc/sysconfig/ceph # /etc/sysconfig/ceph # # Environment file for ceph daemon systemd unit files. # # Increase tcmalloc cache size TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=134217728 -----Original Message----- From: John Petrini [mailto:jpetrini@xxxxxxxxxxxx] Sent: zaterdag 17 februari 2018 1:06 To: David Turner Cc: ceph-users Subject: Re: High Load and High Apply Latency I thought I'd follow up on this just in case anyone else experiences similar issues. We ended up increasing the tcmalloc thread cache size and saw a huge improvement in latency. This got us out of the woods because we were finally in a state where performance was good enough that it was no longer impacting services. The tcmalloc issues are pretty well documented on this mailing list and I don't believe they impact newer versions of Ceph but I thought I'd at least give a data point. After making this change our average apply latency dropped to 3.46ms during peak business hours. To give you an idea of how significant that is here's a graph of the apply latency prior to the change: https://imgur.com/KYUETvD This however did not resolve all of our issues. We were still seeing high iowait (repeated spikes up to 400ms) on three of our OSD nodes on all disks. We tried replacing the RAID controller (PERC H730) on these nodes and while this resolved the issue on one server the two others remained problematic. These two nodes were configured differently than the rest. They'd been configured in non-raid mode while the others were configured as individual raid-0. This turned out to be the problem. We ended up removing the two nodes one at a time and rebuilding them with their disks configured in independent raid-0 instead of non-raid. After this change iowait rarely spikes above 15ms and averages <1ms. I was really surprised at the performance impact when using non-raid mode. While I realize non-raid bypasses the controller cache I still would have never expected such high latency. Dell has a whitepaper that recommends using individual raid-0 but their own tests show only a small performance advantage over non-raid. Note that we are running SAS disks, they actually recommend non-raid mode for SATA but I have not tested this. You can view the whtiepaper here: http://en.community.dell.com/techcenter/cloud/m/dell_cloud_resources/20442913/download I hope this helps someone. John Petrini _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com