Hi John,
HDD cache policy has all caches enabled , WB and ADRA
I am trying to squize extra performance from my test cluster too
Dell R 620 with PERC 710 , RAID0, 10 GB network
Would you be willing to share your controller and kernel configuration ?
For example, I am using BIOS profile 'Performance" with the following added to /etc/default/kernel
intel_pstate=disable intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll
and tuned profile throughput-performance
All disks are configured with nr-request=1024 and read-ahead-kb=4096
SSD uses scheduled= noop while HDD uses deadline
cache policy for SSD
megacli -LDSetProp -WT -Immediate -L0 -a0
megacli -LDSetProp -NORA -Immediate -L0 -a0
megacli -LDSetProp -Direct -Immediate -L0 -a0
Many thanks
Steven
On 16 February 2018 at 19:06, John Petrini <jpetrini@xxxxxxxxxxxx> wrote:
John PetriniI hope this helps someone.I was really surprised at the performance impact when using non-raid mode. While I realize non-raid bypasses the controller cache I still would have never expected such high latency. Dell has a whitepaper that recommends using individual raid-0 but their own tests show only a small performance advantage over non-raid. Note that we are running SAS disks, they actually recommend non-raid mode for SATA but I have not tested this. You can view the whtiepaper here: http://en.community.dell.com/I thought I'd follow up on this just in case anyone else experiences similar issues. We ended up increasing the tcmalloc thread cache size and saw a huge improvement in latency. This got us out of the woods because we were finally in a state where performance was good enough that it was no longer impacting services.This however did not resolve all of our issues. We were still seeing high iowait (repeated spikes up to 400ms) on three of our OSD nodes on all disks. We tried replacing the RAID controller (PERC H730) on these nodes and while this resolved the issue on one server the two others remained problematic. These two nodes were configured differently than the rest. They'd been configured in non-raid mode while the others were configured as individual raid-0. This turned out to be the problem. We ended up removing the two nodes one at a time and rebuilding them with their disks configured in independent raid-0 instead of non-raid. After this change iowait rarely spikes above 15ms and averages <1ms.
The tcmalloc issues are pretty well documented on this mailing list and I don't believe they impact newer versions of Ceph but I thought I'd at least give a data point. After making this change our average apply latency dropped to 3.46ms during peak business hours. To give you an idea of how significant that is here's a graph of the apply latency prior to the change: https://imgur.com/KYUETvDtechcenter/cloud/m/dell_cloud_ resources/20442913/download
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph. com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com