> -----Original Message----- > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of William Josefsson > Sent: 17 October 2016 09:31 > To: Christian Balzer <chibi@xxxxxxx> > Cc: ceph-users@xxxxxxxxxxxxxx > Subject: Re: RBD with SSD journals and SAS OSDs > > Thx Christian for helping troubleshooting the latency issues. I have attached my fio job template below. > > I thought to eliminate the factor that the VM is the bottleneck, I've created a 128GB 32 cCPU flavor. Here's the latest fio benchmark. > http://pastebin.ca/raw/3729693 I'm trying to benchmark the clusters > performance for SYNCED WRITEs and how well suited it would be for disk intensive workloads or DBs > > > > The size (45GB) of these journals is only going to be used by a little > > fraction, unlikely to be more than 1GB in normal operations and with > > default filestore/journal parameters. > > To consume more of the SSDs in the hope to achieve lower latency, can you pls advice what parameters I should be looking at? I have > already tried to what's mentioned in RaySun's ceph blog, which eventually lowered my overall sync write IOPs performance by 1-2k. You biggest gains will probably be around forcing the CPU's to max frequency and forcing c-state to 1. intel_idle.max_cstate=0 on kernel parameters and echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct ( I think this is the same as performance governor) Use something like powertop to check that all cores are running at max freq and are staying in cstate1 I have managed to get the latency on my cluster down to about 600us, but with your hardware I don't suspect you would be able to get it below ~1-1.5ms best case. > > # These are from RaySun's write up, and worsen my total IOPs. > # http://xiaoquqi.github.io/blog/2015/06/28/ceph-performance-optimization-summary/ > > filestore xattr use omap = true > filestore min sync interval = 10 > filestore max sync interval = 15 > filestore queue max ops = 25000 > filestore queue max bytes = 10485760 > filestore queue committing max ops = 5000 filestore queue committing max bytes = 10485760000 journal max write bytes = > 1073714824 journal max write entries = 10000 journal queue max ops = 50000 journal queue max bytes = 10485760000 > > My Journals are Intel s3610 200GB, split in 4-5 partitions each. When I did FIO on the disks locally with direct=1 and sync=1 the WRITE > performance was 50k iops for 7 threads. > > My hardware specs: > > - 3 Controllers, The mons run here > Dell PE R630, 64GB, Intel SSD s3610 > - 9 Storage nodes > Dell 730xd, 2x2630v4 2.2Ghz, 512GB, Journal: 5x200GB Intel 3610 SSD, > OSD: 18x1.8TB Hitachi 10krpm SAS > > RAID Controller is PERC 730 > > All servers have 2x10GbE bonds, Intel ixgbe X540 copper connecting to Arista 7050X 10Gbit Switches with VARP, and LACP interfaces. I > have from my VM pinged all hosts and the RTT is 0.3ms on the LAN. I did iperf, and I can do 10Gbps from the VM to the storage nodes. > > I've already been tuning, CPU scaling governor to 'performance' on all hosts for all cores. My CEPH release is latest hammer on > CentOS7. > > The best write currently happens at 62 threads it seems, the IOPS is 8.3k for the direct synced writes. The latency and stddev are still > concerning.. :( > > simple-write-62: (groupid=14, jobs=62): err= 0: pid=2748: Mon Oct 17 > 15:20:05 2016 > write: io=978.64MB, bw=33397KB/s, iops=8349, runt= 30006msec > clat (msec): min=3, max=20, avg= 7.42, stdev= 2.50 > lat (msec): min=3, max=20, avg= 7.42, stdev= 2.50 > clat percentiles (usec): > | 1.00th=[ 3888], 5.00th=[ 4256], 10.00th=[ 4448], 20.00th=[ 4768], > | 30.00th=[ 5088], 40.00th=[ 5984], 50.00th=[ 7904], 60.00th=[ 8384], > | 70.00th=[ 8768], 80.00th=[ 9408], 90.00th=[10432], 95.00th=[11584], > | 99.00th=[13760], 99.50th=[14784], 99.90th=[16320], 99.95th=[16512], > | 99.99th=[17792] > bw (KB /s): min= 315, max= 761, per=1.61%, avg=537.06, stdev=77.13 > lat (msec) : 4=1.99%, 10=84.54%, 20=13.47%, 50=0.01% > cpu : usr=0.05%, sys=0.35%, ctx=509542, majf=0, minf=1902 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued : total=r=0/w=250527/d=0, short=r=0/w=0/d=0 > > > From the above we can tell that the latency for clients doing synced writes, is somewhere 5-10ms which seems very high, especially > with quite high performing hardware, network, and SSD journals. I'm not sure whether it may be the syncing from Journal to OSD that > causes these fluctuations or high latencies. > > Any help or advice would be much appreciates. thx will > > > [global] > bs=4k > rw=write > sync=1 > direct=1 > iodepth=1 > filename=${FILE} > runtime=30 > stonewall=1 > group_reporting > > [simple-write-6] > numjobs=6 > [simple-write-10] > numjobs=10 > [simple-write-14] > numjobs=14 > [simple-write-18] > numjobs=18 > [simple-write-22] > numjobs=22 > [simple-write-26] > numjobs=26 > [simple-write-30] > numjobs=30 > [simple-write-34] > numjobs=34 > [simple-write-38] > numjobs=38 > [simple-write-42] > numjobs=42 > [simple-write-46] > numjobs=46 > [simple-write-50] > numjobs=50 > [simple-write-54] > numjobs=54 > [simple-write-58] > numjobs=58 > [simple-write-62] > numjobs=62 > [simple-write-66] > numjobs=66 > [simple-write-70] > numjobs=70 > > On Mon, Oct 17, 2016 at 10:47 AM, Christian Balzer <chibi@xxxxxxx> wrote: > > > > Hello, > > > > > > On Sun, 16 Oct 2016 19:07:17 +0800 William Josefsson wrote: > > > >> Ok thanks for sharing. yes my journals are Intel S3610 200GB, which I > >> partition in 4 partitions each ~45GB. When I ceph-deploy I declare > >> these as the journals of the OSDs. > >> > > The size (45GB) of these journals is only going to be used by a little > > fraction, unlikely to be more than 1GB in normal operations and with > > default filestore/journal parameters. > > > > Because those defaults start flushing things (from RAM, the journal > > never gets read unless there is a crash) to the filestore (OSD HDD) > > pretty much immediately. > > > > Again, use google to search the ML archives. > > > >> I was trying to understand the blocking, and how much my SAS OSDs > >> affected my performance. I have a total of 9 hosts, 158 OSDs each > >> 1.8TB. The Servers are connected through copper 10Gbit LACP bonds. > >> My failure domain is by type RACK. The CRUSH rule set is by rack. 3 > >> hosts in each rack. Pool size is =3. I'm running hammer on centos7. > >> > > > > Which begs the question to fully detail your HW (CPUs, RAM), network > > (topology, what switches, inter-rack/switch links), etc. > > The reason for this will become obvious below. > > > >> I did a simple fio test from one of my xl instances, and got the > >> results below. The Latency 7.21ms is worrying, is this expected > >> results? Or is there any way I can further tune my cluster to achieve > >> better results? thx will > >> > > > >> FIO: sync=1, direct=1, bs=4k > >> > > Full command line, please. > > > > Small, sync I/Os are by far the hardest thing for Ceph. > > > > I can guess what some of the rest was, but it's better to know for sure. > > Alternatively, additionally, try this please: > > > > "fio --size=1G --ioengine=libaio --invalidate=1 --direct=1 > > --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4K --iodepth=32" > > > >> > >> write-50: (groupid=11, jobs=50): err= 0: pid=3945: Sun Oct 16 08:41:15 2016 > >> write: io=832092KB, bw=27721KB/s, iops=6930, runt= 30017msec > >> clat (msec): min=2, max=253, avg= 7.21, stdev= 4.97 > >> lat (msec): min=2, max=253, avg= 7.21, stdev= 4.97 > > > > These numbers suggest you did randwrite and aren't all that surprising. > > If you were to run atop on your OSD nodes while doing that fio run, > > you'll likely see that both CPUs and individual disk (HDDs) get very busy. > > > > There are several things conspiring against Ceph here, the latency of > > it's own code, the network latency of getting all the individual > > writes to each replica, the fact that 1000 of these 4K blocks will hit > > one typical RBD object (4MB) and thus one PG, make 3 OSDs very busy, etc. > > > > If you absolutely need low latencies with Ceph, consider dedicated SSD > > only pools for special need applications (DB) or a cache tier if it > > fits the profile and avtive working set. > > Lower Ceph latency in general by having fast CPUs which are have > > powersaving (frequency throttling) disabled or set to "performance" > > instead of "ondemand". > > > > Christan > > > >> clat percentiles (msec): > >> | 1.00th=[ 4], 5.00th=[ 4], 10.00th=[ 5], 20.00th=[ 5], > >> | 30.00th=[ 5], 40.00th=[ 6], 50.00th=[ 7], 60.00th=[ 8], > >> | 70.00th=[ 9], 80.00th=[ 10], 90.00th=[ 12], 95.00th=[ 14], > >> | 99.00th=[ 17], 99.50th=[ 19], 99.90th=[ 21], 99.95th=[ 23], > >> | 99.99th=[ 253] > >> bw (KB /s): min= 341, max= 870, per=2.01%, avg=556.60, stdev=136.98 > >> lat (msec) : 4=8.24%, 10=74.10%, 20=17.52%, 50=0.12%, 500=0.02% > >> cpu : usr=0.04%, sys=0.23%, ctx=425242, majf=0, minf=1570 > >> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > >> issued : total=r=0/w=208023/d=0, short=r=0/w=0/d=0 > >> > >> On Sun, Oct 16, 2016 at 4:18 PM, Christian Balzer <chibi@xxxxxxx> wrote: > >> > > >> > Hello, > >> > > >> > On Sun, 16 Oct 2016 15:03:24 +0800 William Josefsson wrote: > >> > > >> >> Hi list, while I know that writes in the RADOS backend are sync() > >> >> can anyone please explain when the cluster will return on a write > >> >> call for RBD from VMs? Will data be considered synced one written > >> >> to the journal or all the way to the OSD drive? > >> >> > >> > This has been answered countless (really) here, the Ceph > >> > Architecture documentation should really be more detailed about > >> > this, as well as how parallel the data is being sent to the secondary OSDs. > >> > > >> > It is of course ack'ed to the client once all journals have > >> > successfully written the data, otherwise journal SSDs would make a LOT less sense. > >> > > >> >> Each host in my cluster has 5x Intel S3610, and 18x1.8TB Hitachi 10krpm SAS. > >> >> > >> > The size of your SSDs (you didn't mention) will determine the > >> > speed, for journal purposes the sequential write speed is basically it. > >> > > >> > A 5:18 ratio implies that some of your SSDs hold more journals than others. > >> > > >> > You emphatically do NOT want that, because eventually the busier > >> > ones will run out of endurance while the other ones still have plenty left. > >> > > >> > If possible change this to a 5:20 or 6:18 ratio (depending on your > >> > SSDs and expected write volume). > >> > > >> > Christian > >> >> I have size=3 for my pool. Will Ceph return once the data is > >> >> written to at least 3 designated journals, or will it in fact wait > >> >> until the data is written to the OSD drives? thx will > >> >> _______________________________________________ > >> >> ceph-users mailing list > >> >> ceph-users@xxxxxxxxxxxxxx > >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> >> > >> > > >> > > >> > -- > >> > Christian Balzer Network/Systems Engineer > >> > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications > >> > http://www.gol.com/ > >> > > > > > > -- > > Christian Balzer Network/Systems Engineer > > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications > > http://www.gol.com/ > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com