Thx Christian for helping troubleshooting the latency issues. I have attached my fio job template below. I thought to eliminate the factor that the VM is the bottleneck, I've created a 128GB 32 cCPU flavor. Here's the latest fio benchmark. http://pastebin.ca/raw/3729693 I'm trying to benchmark the clusters performance for SYNCED WRITEs and how well suited it would be for disk intensive workloads or DBs > The size (45GB) of these journals is only going to be used by a little > fraction, unlikely to be more than 1GB in normal operations and with > default filestore/journal parameters. To consume more of the SSDs in the hope to achieve lower latency, can you pls advice what parameters I should be looking at? I have already tried to what's mentioned in RaySun's ceph blog, which eventually lowered my overall sync write IOPs performance by 1-2k. # These are from RaySun's write up, and worsen my total IOPs. # http://xiaoquqi.github.io/blog/2015/06/28/ceph-performance-optimization-summary/ filestore xattr use omap = true filestore min sync interval = 10 filestore max sync interval = 15 filestore queue max ops = 25000 filestore queue max bytes = 10485760 filestore queue committing max ops = 5000 filestore queue committing max bytes = 10485760000 journal max write bytes = 1073714824 journal max write entries = 10000 journal queue max ops = 50000 journal queue max bytes = 10485760000 My Journals are Intel s3610 200GB, split in 4-5 partitions each. When I did FIO on the disks locally with direct=1 and sync=1 the WRITE performance was 50k iops for 7 threads. My hardware specs: - 3 Controllers, The mons run here Dell PE R630, 64GB, Intel SSD s3610 - 9 Storage nodes Dell 730xd, 2x2630v4 2.2Ghz, 512GB, Journal: 5x200GB Intel 3610 SSD, OSD: 18x1.8TB Hitachi 10krpm SAS RAID Controller is PERC 730 All servers have 2x10GbE bonds, Intel ixgbe X540 copper connecting to Arista 7050X 10Gbit Switches with VARP, and LACP interfaces. I have from my VM pinged all hosts and the RTT is 0.3ms on the LAN. I did iperf, and I can do 10Gbps from the VM to the storage nodes. I've already been tuning, CPU scaling governor to 'performance' on all hosts for all cores. My CEPH release is latest hammer on CentOS7. The best write currently happens at 62 threads it seems, the IOPS is 8.3k for the direct synced writes. The latency and stddev are still concerning.. :( simple-write-62: (groupid=14, jobs=62): err= 0: pid=2748: Mon Oct 17 15:20:05 2016 write: io=978.64MB, bw=33397KB/s, iops=8349, runt= 30006msec clat (msec): min=3, max=20, avg= 7.42, stdev= 2.50 lat (msec): min=3, max=20, avg= 7.42, stdev= 2.50 clat percentiles (usec): | 1.00th=[ 3888], 5.00th=[ 4256], 10.00th=[ 4448], 20.00th=[ 4768], | 30.00th=[ 5088], 40.00th=[ 5984], 50.00th=[ 7904], 60.00th=[ 8384], | 70.00th=[ 8768], 80.00th=[ 9408], 90.00th=[10432], 95.00th=[11584], | 99.00th=[13760], 99.50th=[14784], 99.90th=[16320], 99.95th=[16512], | 99.99th=[17792] bw (KB /s): min= 315, max= 761, per=1.61%, avg=537.06, stdev=77.13 lat (msec) : 4=1.99%, 10=84.54%, 20=13.47%, 50=0.01% cpu : usr=0.05%, sys=0.35%, ctx=509542, majf=0, minf=1902 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=250527/d=0, short=r=0/w=0/d=0 >From the above we can tell that the latency for clients doing synced writes, is somewhere 5-10ms which seems very high, especially with quite high performing hardware, network, and SSD journals. I'm not sure whether it may be the syncing from Journal to OSD that causes these fluctuations or high latencies. Any help or advice would be much appreciates. thx will [global] bs=4k rw=write sync=1 direct=1 iodepth=1 filename=${FILE} runtime=30 stonewall=1 group_reporting [simple-write-6] numjobs=6 [simple-write-10] numjobs=10 [simple-write-14] numjobs=14 [simple-write-18] numjobs=18 [simple-write-22] numjobs=22 [simple-write-26] numjobs=26 [simple-write-30] numjobs=30 [simple-write-34] numjobs=34 [simple-write-38] numjobs=38 [simple-write-42] numjobs=42 [simple-write-46] numjobs=46 [simple-write-50] numjobs=50 [simple-write-54] numjobs=54 [simple-write-58] numjobs=58 [simple-write-62] numjobs=62 [simple-write-66] numjobs=66 [simple-write-70] numjobs=70 On Mon, Oct 17, 2016 at 10:47 AM, Christian Balzer <chibi@xxxxxxx> wrote: > > Hello, > > > On Sun, 16 Oct 2016 19:07:17 +0800 William Josefsson wrote: > >> Ok thanks for sharing. yes my journals are Intel S3610 200GB, which I >> partition in 4 partitions each ~45GB. When I ceph-deploy I declare >> these as the journals of the OSDs. >> > The size (45GB) of these journals is only going to be used by a little > fraction, unlikely to be more than 1GB in normal operations and with > default filestore/journal parameters. > > Because those defaults start flushing things (from RAM, the journal never > gets read unless there is a crash) to the filestore (OSD HDD) pretty much > immediately. > > Again, use google to search the ML archives. > >> I was trying to understand the blocking, and how much my SAS OSDs >> affected my performance. I have a total of 9 hosts, 158 OSDs each >> 1.8TB. The Servers are connected through copper 10Gbit LACP bonds. >> My failure domain is by type RACK. The CRUSH rule set is by rack. 3 >> hosts in each rack. Pool size is =3. I'm running hammer on centos7. >> > > Which begs the question to fully detail your HW (CPUs, RAM), network > (topology, what switches, inter-rack/switch links), etc. > The reason for this will become obvious below. > >> I did a simple fio test from one of my xl instances, and got the >> results below. The Latency 7.21ms is worrying, is this expected >> results? Or is there any way I can further tune my cluster to achieve >> better results? thx will >> > >> FIO: sync=1, direct=1, bs=4k >> > Full command line, please. > > Small, sync I/Os are by far the hardest thing for Ceph. > > I can guess what some of the rest was, but it's better to know for sure. > Alternatively, additionally, try this please: > > "fio --size=1G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 > --rw=randwrite --name=fiojob --blocksize=4K --iodepth=32" > >> >> write-50: (groupid=11, jobs=50): err= 0: pid=3945: Sun Oct 16 08:41:15 2016 >> write: io=832092KB, bw=27721KB/s, iops=6930, runt= 30017msec >> clat (msec): min=2, max=253, avg= 7.21, stdev= 4.97 >> lat (msec): min=2, max=253, avg= 7.21, stdev= 4.97 > > These numbers suggest you did randwrite and aren't all that surprising. > If you were to run atop on your OSD nodes while doing that fio run, you'll > likely see that both CPUs and individual disk (HDDs) get very busy. > > There are several things conspiring against Ceph here, the latency of it's > own code, the network latency of getting all the individual writes to each > replica, the fact that 1000 of these 4K blocks will hit one typical RBD > object (4MB) and thus one PG, make 3 OSDs very busy, etc. > > If you absolutely need low latencies with Ceph, consider dedicated SSD > only pools for special need applications (DB) or a cache tier if it fits > the profile and avtive working set. > Lower Ceph latency in general by having fast CPUs which are have > powersaving (frequency throttling) disabled or set to "performance" > instead of "ondemand". > > Christan > >> clat percentiles (msec): >> | 1.00th=[ 4], 5.00th=[ 4], 10.00th=[ 5], 20.00th=[ 5], >> | 30.00th=[ 5], 40.00th=[ 6], 50.00th=[ 7], 60.00th=[ 8], >> | 70.00th=[ 9], 80.00th=[ 10], 90.00th=[ 12], 95.00th=[ 14], >> | 99.00th=[ 17], 99.50th=[ 19], 99.90th=[ 21], 99.95th=[ 23], >> | 99.99th=[ 253] >> bw (KB /s): min= 341, max= 870, per=2.01%, avg=556.60, stdev=136.98 >> lat (msec) : 4=8.24%, 10=74.10%, 20=17.52%, 50=0.12%, 500=0.02% >> cpu : usr=0.04%, sys=0.23%, ctx=425242, majf=0, minf=1570 >> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> issued : total=r=0/w=208023/d=0, short=r=0/w=0/d=0 >> >> On Sun, Oct 16, 2016 at 4:18 PM, Christian Balzer <chibi@xxxxxxx> wrote: >> > >> > Hello, >> > >> > On Sun, 16 Oct 2016 15:03:24 +0800 William Josefsson wrote: >> > >> >> Hi list, while I know that writes in the RADOS backend are sync() can >> >> anyone please explain when the cluster will return on a write call for >> >> RBD from VMs? Will data be considered synced one written to the >> >> journal or all the way to the OSD drive? >> >> >> > This has been answered countless (really) here, the Ceph Architecture >> > documentation should really be more detailed about this, as well as how >> > parallel the data is being sent to the secondary OSDs. >> > >> > It is of course ack'ed to the client once all journals have successfully >> > written the data, otherwise journal SSDs would make a LOT less sense. >> > >> >> Each host in my cluster has 5x Intel S3610, and 18x1.8TB Hitachi 10krpm SAS. >> >> >> > The size of your SSDs (you didn't mention) will determine the speed, for >> > journal purposes the sequential write speed is basically it. >> > >> > A 5:18 ratio implies that some of your SSDs hold more journals than others. >> > >> > You emphatically do NOT want that, because eventually the busier ones will >> > run out of endurance while the other ones still have plenty left. >> > >> > If possible change this to a 5:20 or 6:18 ratio (depending on your SSDs >> > and expected write volume). >> > >> > Christian >> >> I have size=3 for my pool. Will Ceph return once the data is written >> >> to at least 3 designated journals, or will it in fact wait until the >> >> data is written to the OSD drives? thx will >> >> _______________________________________________ >> >> ceph-users mailing list >> >> ceph-users@xxxxxxxxxxxxxx >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> > >> > >> > -- >> > Christian Balzer Network/Systems Engineer >> > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications >> > http://www.gol.com/ >> > > > -- > Christian Balzer Network/Systems Engineer > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications > http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com