Re: High Load and High Apply Latency

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



But that is already the default not? (on CentOS7 rpms)

[@c03 ~]# cat /etc/sysconfig/ceph
# /etc/sysconfig/ceph
#
# Environment file for ceph daemon systemd unit files.
#

# Increase tcmalloc cache size
TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=134217728
 



-----Original Message-----
From: John Petrini [mailto:jpetrini@xxxxxxxxxxxx] 
Sent: zaterdag 17 februari 2018 1:06
To: David Turner
Cc: ceph-users
Subject: Re:  High Load and High Apply Latency

I thought I'd follow up on this just in case anyone else experiences 
similar issues. We ended up increasing the tcmalloc thread cache size 
and saw a huge improvement in latency. This got us out of the woods 
because we were finally in a state where performance was good enough 
that it was no longer impacting services. 

The tcmalloc issues are pretty well documented on this mailing list and 
I don't believe they impact newer versions of Ceph but I thought I'd at 
least give a data point. After making this change our average apply 
latency dropped to 3.46ms during peak business hours. To give you an 
idea of how significant that is here's a graph of the apply latency 
prior to the change: https://imgur.com/KYUETvD


This however did not resolve all of our issues. We were still seeing 
high iowait (repeated spikes up to 400ms) on three of our OSD nodes on 
all disks. We tried replacing the RAID controller (PERC H730) on these 
nodes and while this resolved the issue on one server the two others 
remained problematic. These two nodes were configured differently than 
the rest. They'd been configured in non-raid mode while the others were 
configured as individual raid-0. This turned out to be the problem. We 
ended up removing the two nodes one at a time and rebuilding them with 
their disks configured in independent raid-0 instead of non-raid. After 
this change iowait rarely spikes above 15ms and averages <1ms.


I was really surprised at the performance impact when using non-raid 
mode. While I realize non-raid bypasses the controller cache I still 
would have never expected such high latency. Dell has a whitepaper that 
recommends using individual raid-0 but their own tests show only a small 
performance advantage over non-raid. Note that we are running SAS disks, 
they actually recommend non-raid mode for SATA but I have not tested 
this. You can view the whtiepaper here: 
http://en.community.dell.com/techcenter/cloud/m/dell_cloud_resources/20442913/download


I hope this helps someone.


John Petrini



_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux