hi nick, I earlier did cpupower frequency-set --cpu-governor performance on all my hosts, which bumped all CPUs up to almost max speed or more. It didn't really help much, and I still experience 5-10ms latency in my fio benchmarks in VMs with this job description. Is there anything else I can do to force the SSDs to be used more? I know DIRECT SYNCED WRITE may not be the most common application case, however I need help to improve a worst case. Benchmarking these ssd locally with fio and direct sync write, can do 40-50k IOPS. I'm not sure exactly what, but something is holding back the max performance. I know the journals are sparely used from collectd graphs. appreciate any advice. thx will >> [global] >> bs=4k >> rw=write >> sync=1 >> direct=1 >> iodepth=1 >> filename=/dev/vdb1 >> runtime=30 >> stonewall=1 >> group_reporting grep "cpu MHz" /proc/cpuinfo cpu MHz : 2945.250 cpu MHz : 2617.500 cpu MHz : 3065.062 cpu MHz : 2574.281 cpu MHz : 2739.468 cpu MHz : 2857.593 cpu MHz : 2602.125 cpu MHz : 2581.687 cpu MHz : 2958.656 cpu MHz : 2793.093 cpu MHz : 2682.750 cpu MHz : 2699.718 cpu MHz : 2620.125 cpu MHz : 2926.875 cpu MHz : 2740.031 cpu MHz : 2559.656 cpu MHz : 2758.875 cpu MHz : 2656.593 cpu MHz : 1476.187 cpu MHz : 2545.125 cpu MHz : 2792.718 cpu MHz : 2630.156 cpu MHz : 3090.750 cpu MHz : 2951.906 cpu MHz : 2845.875 cpu MHz : 2553.281 cpu MHz : 2602.125 cpu MHz : 2600.906 cpu MHz : 2737.031 cpu MHz : 2552.156 cpu MHz : 2624.625 cpu MHz : 2614.125 On Mon, Oct 17, 2016 at 5:17 PM, Nick Fisk <nick@xxxxxxxxxx> wrote: >> -----Original Message----- >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of William Josefsson >> Sent: 17 October 2016 09:31 >> To: Christian Balzer <chibi@xxxxxxx> >> Cc: ceph-users@xxxxxxxxxxxxxx >> Subject: Re: RBD with SSD journals and SAS OSDs >> >> Thx Christian for helping troubleshooting the latency issues. I have attached my fio job template below. >> >> I thought to eliminate the factor that the VM is the bottleneck, I've created a 128GB 32 cCPU flavor. Here's the latest fio > benchmark. >> http://pastebin.ca/raw/3729693 I'm trying to benchmark the clusters >> performance for SYNCED WRITEs and how well suited it would be for disk intensive workloads or DBs >> >> >> > The size (45GB) of these journals is only going to be used by a little >> > fraction, unlikely to be more than 1GB in normal operations and with >> > default filestore/journal parameters. >> >> To consume more of the SSDs in the hope to achieve lower latency, can you pls advice what parameters I should be looking at? I > have >> already tried to what's mentioned in RaySun's ceph blog, which eventually lowered my overall sync write IOPs performance by 1-2k. > > You biggest gains will probably be around forcing the CPU's to max frequency and forcing c-state to 1. > > intel_idle.max_cstate=0 on kernel parameters > and > echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct ( I think this is the same as performance governor) > > Use something like powertop to check that all cores are running at max freq and are staying in cstate1 > > I have managed to get the latency on my cluster down to about 600us, but with your hardware I don't suspect you would be able to get > it below ~1-1.5ms best case. > >> >> # These are from RaySun's write up, and worsen my total IOPs. >> # http://xiaoquqi.github.io/blog/2015/06/28/ceph-performance-optimization-summary/ >> >> filestore xattr use omap = true >> filestore min sync interval = 10 >> filestore max sync interval = 15 >> filestore queue max ops = 25000 >> filestore queue max bytes = 10485760 >> filestore queue committing max ops = 5000 filestore queue committing max bytes = 10485760000 journal max write bytes = >> 1073714824 journal max write entries = 10000 journal queue max ops = 50000 journal queue max bytes = 10485760000 >> >> My Journals are Intel s3610 200GB, split in 4-5 partitions each. When I did FIO on the disks locally with direct=1 and sync=1 the > WRITE >> performance was 50k iops for 7 threads. >> >> My hardware specs: >> >> - 3 Controllers, The mons run here >> Dell PE R630, 64GB, Intel SSD s3610 >> - 9 Storage nodes >> Dell 730xd, 2x2630v4 2.2Ghz, 512GB, Journal: 5x200GB Intel 3610 SSD, >> OSD: 18x1.8TB Hitachi 10krpm SAS >> >> RAID Controller is PERC 730 >> >> All servers have 2x10GbE bonds, Intel ixgbe X540 copper connecting to Arista 7050X 10Gbit Switches with VARP, and LACP interfaces. > I >> have from my VM pinged all hosts and the RTT is 0.3ms on the LAN. I did iperf, and I can do 10Gbps from the VM to the storage > nodes. >> >> I've already been tuning, CPU scaling governor to 'performance' on all hosts for all cores. My CEPH release is latest hammer on >> CentOS7. >> >> The best write currently happens at 62 threads it seems, the IOPS is 8.3k for the direct synced writes. The latency and stddev are > still >> concerning.. :( >> >> simple-write-62: (groupid=14, jobs=62): err= 0: pid=2748: Mon Oct 17 >> 15:20:05 2016 >> write: io=978.64MB, bw=33397KB/s, iops=8349, runt= 30006msec >> clat (msec): min=3, max=20, avg= 7.42, stdev= 2.50 >> lat (msec): min=3, max=20, avg= 7.42, stdev= 2.50 >> clat percentiles (usec): >> | 1.00th=[ 3888], 5.00th=[ 4256], 10.00th=[ 4448], 20.00th=[ 4768], >> | 30.00th=[ 5088], 40.00th=[ 5984], 50.00th=[ 7904], 60.00th=[ 8384], >> | 70.00th=[ 8768], 80.00th=[ 9408], 90.00th=[10432], 95.00th=[11584], >> | 99.00th=[13760], 99.50th=[14784], 99.90th=[16320], 99.95th=[16512], >> | 99.99th=[17792] >> bw (KB /s): min= 315, max= 761, per=1.61%, avg=537.06, stdev=77.13 >> lat (msec) : 4=1.99%, 10=84.54%, 20=13.47%, 50=0.01% >> cpu : usr=0.05%, sys=0.35%, ctx=509542, majf=0, minf=1902 >> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> issued : total=r=0/w=250527/d=0, short=r=0/w=0/d=0 >> >> >> From the above we can tell that the latency for clients doing synced writes, is somewhere 5-10ms which seems very high, especially >> with quite high performing hardware, network, and SSD journals. I'm not sure whether it may be the syncing from Journal to OSD > that >> causes these fluctuations or high latencies. >> >> Any help or advice would be much appreciates. thx will >> >> >> [global] >> bs=4k >> rw=write >> sync=1 >> direct=1 >> iodepth=1 >> filename=${FILE} >> runtime=30 >> stonewall=1 >> group_reporting >> >> [simple-write-6] >> numjobs=6 >> [simple-write-10] >> numjobs=10 >> [simple-write-14] >> numjobs=14 >> [simple-write-18] >> numjobs=18 >> [simple-write-22] >> numjobs=22 >> [simple-write-26] >> numjobs=26 >> [simple-write-30] >> numjobs=30 >> [simple-write-34] >> numjobs=34 >> [simple-write-38] >> numjobs=38 >> [simple-write-42] >> numjobs=42 >> [simple-write-46] >> numjobs=46 >> [simple-write-50] >> numjobs=50 >> [simple-write-54] >> numjobs=54 >> [simple-write-58] >> numjobs=58 >> [simple-write-62] >> numjobs=62 >> [simple-write-66] >> numjobs=66 >> [simple-write-70] >> numjobs=70 >> >> On Mon, Oct 17, 2016 at 10:47 AM, Christian Balzer <chibi@xxxxxxx> wrote: >> > >> > Hello, >> > >> > >> > On Sun, 16 Oct 2016 19:07:17 +0800 William Josefsson wrote: >> > >> >> Ok thanks for sharing. yes my journals are Intel S3610 200GB, which I >> >> partition in 4 partitions each ~45GB. When I ceph-deploy I declare >> >> these as the journals of the OSDs. >> >> >> > The size (45GB) of these journals is only going to be used by a little >> > fraction, unlikely to be more than 1GB in normal operations and with >> > default filestore/journal parameters. >> > >> > Because those defaults start flushing things (from RAM, the journal >> > never gets read unless there is a crash) to the filestore (OSD HDD) >> > pretty much immediately. >> > >> > Again, use google to search the ML archives. >> > >> >> I was trying to understand the blocking, and how much my SAS OSDs >> >> affected my performance. I have a total of 9 hosts, 158 OSDs each >> >> 1.8TB. The Servers are connected through copper 10Gbit LACP bonds. >> >> My failure domain is by type RACK. The CRUSH rule set is by rack. 3 >> >> hosts in each rack. Pool size is =3. I'm running hammer on centos7. >> >> >> > >> > Which begs the question to fully detail your HW (CPUs, RAM), network >> > (topology, what switches, inter-rack/switch links), etc. >> > The reason for this will become obvious below. >> > >> >> I did a simple fio test from one of my xl instances, and got the >> >> results below. The Latency 7.21ms is worrying, is this expected >> >> results? Or is there any way I can further tune my cluster to achieve >> >> better results? thx will >> >> >> > >> >> FIO: sync=1, direct=1, bs=4k >> >> >> > Full command line, please. >> > >> > Small, sync I/Os are by far the hardest thing for Ceph. >> > >> > I can guess what some of the rest was, but it's better to know for sure. >> > Alternatively, additionally, try this please: >> > >> > "fio --size=1G --ioengine=libaio --invalidate=1 --direct=1 >> > --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4K --iodepth=32" >> > >> >> >> >> write-50: (groupid=11, jobs=50): err= 0: pid=3945: Sun Oct 16 08:41:15 2016 >> >> write: io=832092KB, bw=27721KB/s, iops=6930, runt= 30017msec >> >> clat (msec): min=2, max=253, avg= 7.21, stdev= 4.97 >> >> lat (msec): min=2, max=253, avg= 7.21, stdev= 4.97 >> > >> > These numbers suggest you did randwrite and aren't all that surprising. >> > If you were to run atop on your OSD nodes while doing that fio run, >> > you'll likely see that both CPUs and individual disk (HDDs) get very busy. >> > >> > There are several things conspiring against Ceph here, the latency of >> > it's own code, the network latency of getting all the individual >> > writes to each replica, the fact that 1000 of these 4K blocks will hit >> > one typical RBD object (4MB) and thus one PG, make 3 OSDs very busy, etc. >> > >> > If you absolutely need low latencies with Ceph, consider dedicated SSD >> > only pools for special need applications (DB) or a cache tier if it >> > fits the profile and avtive working set. >> > Lower Ceph latency in general by having fast CPUs which are have >> > powersaving (frequency throttling) disabled or set to "performance" >> > instead of "ondemand". >> > >> > Christan >> > >> >> clat percentiles (msec): >> >> | 1.00th=[ 4], 5.00th=[ 4], 10.00th=[ 5], 20.00th=[ 5], >> >> | 30.00th=[ 5], 40.00th=[ 6], 50.00th=[ 7], 60.00th=[ 8], >> >> | 70.00th=[ 9], 80.00th=[ 10], 90.00th=[ 12], 95.00th=[ 14], >> >> | 99.00th=[ 17], 99.50th=[ 19], 99.90th=[ 21], 99.95th=[ 23], >> >> | 99.99th=[ 253] >> >> bw (KB /s): min= 341, max= 870, per=2.01%, avg=556.60, stdev=136.98 >> >> lat (msec) : 4=8.24%, 10=74.10%, 20=17.52%, 50=0.12%, 500=0.02% >> >> cpu : usr=0.04%, sys=0.23%, ctx=425242, majf=0, minf=1570 >> >> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% >> >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> >> issued : total=r=0/w=208023/d=0, short=r=0/w=0/d=0 >> >> >> >> On Sun, Oct 16, 2016 at 4:18 PM, Christian Balzer <chibi@xxxxxxx> wrote: >> >> > >> >> > Hello, >> >> > >> >> > On Sun, 16 Oct 2016 15:03:24 +0800 William Josefsson wrote: >> >> > >> >> >> Hi list, while I know that writes in the RADOS backend are sync() >> >> >> can anyone please explain when the cluster will return on a write >> >> >> call for RBD from VMs? Will data be considered synced one written >> >> >> to the journal or all the way to the OSD drive? >> >> >> >> >> > This has been answered countless (really) here, the Ceph >> >> > Architecture documentation should really be more detailed about >> >> > this, as well as how parallel the data is being sent to the secondary OSDs. >> >> > >> >> > It is of course ack'ed to the client once all journals have >> >> > successfully written the data, otherwise journal SSDs would make a LOT less sense. >> >> > >> >> >> Each host in my cluster has 5x Intel S3610, and 18x1.8TB Hitachi 10krpm SAS. >> >> >> >> >> > The size of your SSDs (you didn't mention) will determine the >> >> > speed, for journal purposes the sequential write speed is basically it. >> >> > >> >> > A 5:18 ratio implies that some of your SSDs hold more journals than others. >> >> > >> >> > You emphatically do NOT want that, because eventually the busier >> >> > ones will run out of endurance while the other ones still have plenty left. >> >> > >> >> > If possible change this to a 5:20 or 6:18 ratio (depending on your >> >> > SSDs and expected write volume). >> >> > >> >> > Christian >> >> >> I have size=3 for my pool. Will Ceph return once the data is >> >> >> written to at least 3 designated journals, or will it in fact wait >> >> >> until the data is written to the OSD drives? thx will >> >> >> _______________________________________________ >> >> >> ceph-users mailing list >> >> >> ceph-users@xxxxxxxxxxxxxx >> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >> >> > >> >> > >> >> > -- >> >> > Christian Balzer Network/Systems Engineer >> >> > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications >> >> > http://www.gol.com/ >> >> >> > >> > >> > -- >> > Christian Balzer Network/Systems Engineer >> > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications >> > http://www.gol.com/ >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com