Hello, Note that the tests below were done on a VM with RBD cache disabled, so the "direct=1" flag in FIO had a similar impact to "sync=1". If your databases are MySQL, Oracle or something else that can use O_DIRECT, RBD caching can improve things dramatically for you (with the same risks that on-disk caches pose). Compare this FIO on the non-cache VM: (fio --size=1G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=write --name=fiojob --blocksize=4K --iodepth=1) --- write: io=1024.0MB, bw=2785.9KB/s, iops=696, runt=376395msec slat (usec): min=5, max=307, avg=15.58, stdev= 7.09 clat (usec): min=930, max=26628, avg=1417.07, stdev=251.83 lat (usec): min=944, max=26654, avg=1432.96, stdev=253.30 --- to the same one on a VM with default RBD caching: --- write: io=1024.0MB, bw=118269KB/s, iops=29567, runt= 8866msec slat (usec): min=2, max=241, avg= 3.93, stdev= 1.32 clat (usec): min=0, max=24091, avg=29.29, stdev=59.01 lat (usec): min=29, max=24095, avg=33.34, stdev=59.04 --- Christian On Tue, 18 Oct 2016 23:55:31 +0800 William Josefsson wrote: > Thx Christian for elaborating on this appreciate it, I will rerun some > of my benchmarks and take your advice into consideration. I have also > found maximum performance recommendations for the dell 730xd bios > settings, hope these make sense: http://pasteboard.co/guHVMQVly.jpg > I will set all these settings, and intel_idle.max_cstate=0 as > suggested by Nick and rerun fio benchmarks. thx will > > > On Tue, Oct 18, 2016 at 9:44 AM, Christian Balzer <chibi@xxxxxxx> wrote: > > > > Hello, > > > > As I had this written mostly already and since it covers some points Nick > > raised in more detail, here we go. > > > > On Mon, 17 Oct 2016 16:30:48 +0800 William Josefsson wrote: > > > >> Thx Christian for helping troubleshooting the latency issues. I have > >> attached my fio job template below. > >> > > There's no trouble here per se, just facts of life (Ceph). > > > > You'll be well advised to search the ML, especially with what Nick Fisk > > had to write about these things (several times). > > > >> I thought to eliminate the factor that the VM is the bottleneck, I've > >> created a 128GB 32 cCPU flavor. > > Nope, The client is not the issue. > > > >>Here's the latest fio benchmark. > >> http://pastebin.ca/raw/3729693 I'm trying to benchmark the clusters > >> performance for SYNCED WRITEs and how well suited it would be for disk > >> intensive workloads or DBs > >> > > > > A single IOPS of that type and size will only hit the journal and be > > ACK'ed quickly (well quicker than what you see now), but FIO is a creating > > a constant stream of requests, eventually hitting the actual OSD as well. > > > > Aside from CPU load, of course. > > > >> > >> > The size (45GB) of these journals is only going to be used by a little > >> > fraction, unlikely to be more than 1GB in normal operations and with > >> > default filestore/journal parameters. > >> > >> To consume more of the SSDs in the hope to achieve lower latency, can > >> you pls advice what parameters I should be looking at? > > > > Not going to help with your prolonged FIO runs and once the flushing to > > OSDs comments, stalls will ensue. > > The moment the journal is full or the timers kick in, things will go down > > to OSD (HDD) speed. > > The journal is there to help with small, short bursts. > > > >>I have already > >> tried to what's mentioned in RaySun's ceph blog, which eventually > >> lowered my overall sync write IOPs performance by 1-2k. > >> > > Unsurprisingly, the default values are there for a reason. > > > >> # These are from RaySun's write up, and worsen my total IOPs. > >> # http://xiaoquqi.github.io/blog/2015/06/28/ceph-performance-optimization-summary/ > >> > >> filestore xattr use omap = true > >> filestore min sync interval = 10 > > Way too high, 0.5 is probably already excessive, I run with 0.1. > > > >> filestore max sync interval = 15 > > > >> filestore queue max ops = 25000 > >> filestore queue max bytes = 10485760 > >> filestore queue committing max ops = 5000 > >> filestore queue committing max bytes = 10485760000 > > Your HDDs will choke on those 4. With a 10k SAS HDD a small increase of > > the defaults may help. > > > >> journal max write bytes = 1073714824 > >> journal max write entries = 10000 > >> journal queue max ops = 50000 > >> journal queue max bytes = 10485760000 > >> > >> My Journals are Intel s3610 200GB, split in 4-5 partitions each. > > Again, you want to event that out. > > > >>When > >> I did FIO on the disks locally with direct=1 and sync=1 the WRITE > >> performance was 50k iops for 7 threads. > >> > > Yes, but as I wrote that's not how journals work, think more of 7 > > sequential writes, not rand-writes. > > > > And as I tried to explain before, the SSDs are not the bottleneck, your > > CPUs may be and your OSD HDDs eventually will be. > > Run atop on all your nodes when doing those tests and see how much things > > get pushed (CPUs, disks, the OSD processes). > > > >> My hardware specs: > >> > >> - 3 Controllers, The mons run here > >> Dell PE R630, 64GB, Intel SSD s3610 > >> - 9 Storage nodes > >> Dell 730xd, 2x2630v4 2.2Ghz, 512GB, Journal: 5x200GB Intel 3610 SSD, > >> OSD: 18x1.8TB Hitachi 10krpm SAS > >> > > I can't really fault you for the choice of CPU, but smaller nodes with > > higher speed and fewer cores may help with this extreme test case (in > > normal production you're fine). > > > >> RAID Controller is PERC 730 > >> > >> All servers have 2x10GbE bonds, Intel ixgbe X540 copper connecting to > >> Arista 7050X 10Gbit Switches with VARP, and LACP interfaces. I have > >> from my VM pinged all hosts and the RTT is 0.3ms on the LAN. I did > >> iperf, and I can do 10Gbps from the VM to the storage nodes. > >> > > Bandwidth is irrelevant in this case, the RTT of 0.3ms feels a bit high. > > If you look again at the flow in > > http://docs.ceph.com/docs/hammer/architecture/#smart-daemons-enable-hyperscale > > > > those will add up to a significant part of your Ceph latency. > > > > To elaborate and demonstrate: > > > > I have test cluster, consisting of 4 nodes, 2 of them HDD backed OSDs with > > SSD journals and 2 of them SSD based (4x DC S3610 400GB each) as a > > cache-tier for the "normal" ones. All replication 2. > > So for the purpose of this test, this is all 100% against the SSDs in the > > cache-pool only. > > > > The network is IPoIB (QDDR, 40Gb/s Infiniband) with 0.1ms latency between > > nodes, CPU is a single E5-2620 v3. > > > > If I run this from a VM: > > --- > > fio --size=1G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4K --iodepth=64 > > --- > > > > We wind up with: > > --- > > write: io=1024.0MB, bw=34172KB/s, iops=8543, runt= 30685msec > > slat (usec): min=1, max=2874, avg= 4.66, stdev= 7.07 > > clat (msec): min=1, max=66, avg= 7.49, stdev= 7.80 > > lat (msec): min=1, max=66, avg= 7.49, stdev= 7.80 > > --- > > During this run the CPU is the bottleneck, idle is around 60% (of 1200), > > all 4 OSD processes eat up nearly 3 CPU "cores". > > As I said, small random IOPS is the most stressful thing for Ceph. > > CPU performance settings influence this little/not at all, as everything > > goes to full speed in less than a second and stays there. > > > > > > If we change the FIO invocation to plain sequential "--rw=write" the CPU > > usage is less than 250% (out of 1200), things are pretty relaxed. > > At that point we're basically pushing the edge of latency in all > > components involved: > > --- > > write: io=1024.0MB, bw=37819KB/s, iops=9454, runt= 27726msec > > slat (usec): min=1, max=3834, avg= 3.77, stdev= 8.42 > > clat (usec): min=943, max=38129, avg=6764.11, stdev=3262.91 > > lat (usec): min=954, max=38135, avg=6768.04, stdev=3263.55 > > --- > > > > If we lower this consequently to just one thread with "--iodepth=1" to see > > how fast things could potentially be if we don't saturate everything: > > --- > > slat (usec): min=12, max=100, avg=21.43, stdev= 7.96 > > clat (usec): min=1725, max=5873, avg=2485.46, stdev=256.97 > > lat (usec): min=1744, max=5894, avg=2507.35, stdev=257.11 > > --- > > > > So 2.5ms instead of 7ms. Not too shabby. > > > > > > Now if we do the same run but with CPU governors set to performance we get: > > --- > > slat (usec): min=6, max=291, avg=17.34, stdev= 8.00 > > clat (usec): min=957, max=13754, avg=1425.83, stdev=262.85 > > lat (usec): min=968, max=13766, avg=1443.56, stdev=264.54 > > --- > > > > So that's where the CPU tuning comes in. > > And this is, in real life where you hopefully don't have thousands of > > small sync I/Os at the same time, a pretty decent result. > > > > > >> I've already been tuning, CPU scaling governor to 'performance' on all > >> hosts for all cores. My CEPH release is latest hammer on CentOS7. > >> > > Jewel is also supposed to have many improvements in this area, but frankly > > I haven't been brave (convinced) enough to upgrade from Hammer yet. > > > > Christian > > > >> The best write currently happens at 62 threads it seems, the IOPS is > >> 8.3k for the direct synced writes. The latency and stddev are still > >> concerning.. :( > >> > >> simple-write-62: (groupid=14, jobs=62): err= 0: pid=2748: Mon Oct 17 > >> 15:20:05 2016 > >> write: io=978.64MB, bw=33397KB/s, iops=8349, runt= 30006msec > >> clat (msec): min=3, max=20, avg= 7.42, stdev= 2.50 > >> lat (msec): min=3, max=20, avg= 7.42, stdev= 2.50 > >> clat percentiles (usec): > >> | 1.00th=[ 3888], 5.00th=[ 4256], 10.00th=[ 4448], 20.00th=[ 4768], > >> | 30.00th=[ 5088], 40.00th=[ 5984], 50.00th=[ 7904], 60.00th=[ 8384], > >> | 70.00th=[ 8768], 80.00th=[ 9408], 90.00th=[10432], 95.00th=[11584], > >> | 99.00th=[13760], 99.50th=[14784], 99.90th=[16320], 99.95th=[16512], > >> | 99.99th=[17792] > >> bw (KB /s): min= 315, max= 761, per=1.61%, avg=537.06, stdev=77.13 > >> lat (msec) : 4=1.99%, 10=84.54%, 20=13.47%, 50=0.01% > >> cpu : usr=0.05%, sys=0.35%, ctx=509542, majf=0, minf=1902 > >> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > >> issued : total=r=0/w=250527/d=0, short=r=0/w=0/d=0 > >> > >> > >> From the above we can tell that the latency for clients doing synced > >> writes, is somewhere 5-10ms which seems very high, especially with > >> quite high performing hardware, network, and SSD journals. I'm not > >> sure whether it may be the syncing from Journal to OSD that causes > >> these fluctuations or high latencies. > >> > >> Any help or advice would be much appreciates. thx will > >> > >> > >> [global] > >> bs=4k > >> rw=write > >> sync=1 > >> direct=1 > >> iodepth=1 > >> filename=${FILE} > >> runtime=30 > >> stonewall=1 > >> group_reporting > >> > >> [simple-write-6] > >> numjobs=6 > >> [simple-write-10] > >> numjobs=10 > >> [simple-write-14] > >> numjobs=14 > >> [simple-write-18] > >> numjobs=18 > >> [simple-write-22] > >> numjobs=22 > >> [simple-write-26] > >> numjobs=26 > >> [simple-write-30] > >> numjobs=30 > >> [simple-write-34] > >> numjobs=34 > >> [simple-write-38] > >> numjobs=38 > >> [simple-write-42] > >> numjobs=42 > >> [simple-write-46] > >> numjobs=46 > >> [simple-write-50] > >> numjobs=50 > >> [simple-write-54] > >> numjobs=54 > >> [simple-write-58] > >> numjobs=58 > >> [simple-write-62] > >> numjobs=62 > >> [simple-write-66] > >> numjobs=66 > >> [simple-write-70] > >> numjobs=70 > >> > >> On Mon, Oct 17, 2016 at 10:47 AM, Christian Balzer <chibi@xxxxxxx> wrote: > >> > > >> > Hello, > >> > > >> > > >> > On Sun, 16 Oct 2016 19:07:17 +0800 William Josefsson wrote: > >> > > >> >> Ok thanks for sharing. yes my journals are Intel S3610 200GB, which I > >> >> partition in 4 partitions each ~45GB. When I ceph-deploy I declare > >> >> these as the journals of the OSDs. > >> >> > >> > The size (45GB) of these journals is only going to be used by a little > >> > fraction, unlikely to be more than 1GB in normal operations and with > >> > default filestore/journal parameters. > >> > > >> > Because those defaults start flushing things (from RAM, the journal never > >> > gets read unless there is a crash) to the filestore (OSD HDD) pretty much > >> > immediately. > >> > > >> > Again, use google to search the ML archives. > >> > > >> >> I was trying to understand the blocking, and how much my SAS OSDs > >> >> affected my performance. I have a total of 9 hosts, 158 OSDs each > >> >> 1.8TB. The Servers are connected through copper 10Gbit LACP bonds. > >> >> My failure domain is by type RACK. The CRUSH rule set is by rack. 3 > >> >> hosts in each rack. Pool size is =3. I'm running hammer on centos7. > >> >> > >> > > >> > Which begs the question to fully detail your HW (CPUs, RAM), network > >> > (topology, what switches, inter-rack/switch links), etc. > >> > The reason for this will become obvious below. > >> > > >> >> I did a simple fio test from one of my xl instances, and got the > >> >> results below. The Latency 7.21ms is worrying, is this expected > >> >> results? Or is there any way I can further tune my cluster to achieve > >> >> better results? thx will > >> >> > >> > > >> >> FIO: sync=1, direct=1, bs=4k > >> >> > >> > Full command line, please. > >> > > >> > Small, sync I/Os are by far the hardest thing for Ceph. > >> > > >> > I can guess what some of the rest was, but it's better to know for sure. > >> > Alternatively, additionally, try this please: > >> > > >> > "fio --size=1G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 > >> > --rw=randwrite --name=fiojob --blocksize=4K --iodepth=32" > >> > > >> >> > >> >> write-50: (groupid=11, jobs=50): err= 0: pid=3945: Sun Oct 16 08:41:15 2016 > >> >> write: io=832092KB, bw=27721KB/s, iops=6930, runt= 30017msec > >> >> clat (msec): min=2, max=253, avg= 7.21, stdev= 4.97 > >> >> lat (msec): min=2, max=253, avg= 7.21, stdev= 4.97 > >> > > >> > These numbers suggest you did randwrite and aren't all that surprising. > >> > If you were to run atop on your OSD nodes while doing that fio run, you'll > >> > likely see that both CPUs and individual disk (HDDs) get very busy. > >> > > >> > There are several things conspiring against Ceph here, the latency of it's > >> > own code, the network latency of getting all the individual writes to each > >> > replica, the fact that 1000 of these 4K blocks will hit one typical RBD > >> > object (4MB) and thus one PG, make 3 OSDs very busy, etc. > >> > > >> > If you absolutely need low latencies with Ceph, consider dedicated SSD > >> > only pools for special need applications (DB) or a cache tier if it fits > >> > the profile and avtive working set. > >> > Lower Ceph latency in general by having fast CPUs which are have > >> > powersaving (frequency throttling) disabled or set to "performance" > >> > instead of "ondemand". > >> > > >> > Christan > >> > > >> >> clat percentiles (msec): > >> >> | 1.00th=[ 4], 5.00th=[ 4], 10.00th=[ 5], 20.00th=[ 5], > >> >> | 30.00th=[ 5], 40.00th=[ 6], 50.00th=[ 7], 60.00th=[ 8], > >> >> | 70.00th=[ 9], 80.00th=[ 10], 90.00th=[ 12], 95.00th=[ 14], > >> >> | 99.00th=[ 17], 99.50th=[ 19], 99.90th=[ 21], 99.95th=[ 23], > >> >> | 99.99th=[ 253] > >> >> bw (KB /s): min= 341, max= 870, per=2.01%, avg=556.60, stdev=136.98 > >> >> lat (msec) : 4=8.24%, 10=74.10%, 20=17.52%, 50=0.12%, 500=0.02% > >> >> cpu : usr=0.04%, sys=0.23%, ctx=425242, majf=0, minf=1570 > >> >> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > >> >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > >> >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > >> >> issued : total=r=0/w=208023/d=0, short=r=0/w=0/d=0 > >> >> > >> >> On Sun, Oct 16, 2016 at 4:18 PM, Christian Balzer <chibi@xxxxxxx> wrote: > >> >> > > >> >> > Hello, > >> >> > > >> >> > On Sun, 16 Oct 2016 15:03:24 +0800 William Josefsson wrote: > >> >> > > >> >> >> Hi list, while I know that writes in the RADOS backend are sync() can > >> >> >> anyone please explain when the cluster will return on a write call for > >> >> >> RBD from VMs? Will data be considered synced one written to the > >> >> >> journal or all the way to the OSD drive? > >> >> >> > >> >> > This has been answered countless (really) here, the Ceph Architecture > >> >> > documentation should really be more detailed about this, as well as how > >> >> > parallel the data is being sent to the secondary OSDs. > >> >> > > >> >> > It is of course ack'ed to the client once all journals have successfully > >> >> > written the data, otherwise journal SSDs would make a LOT less sense. > >> >> > > >> >> >> Each host in my cluster has 5x Intel S3610, and 18x1.8TB Hitachi 10krpm SAS. > >> >> >> > >> >> > The size of your SSDs (you didn't mention) will determine the speed, for > >> >> > journal purposes the sequential write speed is basically it. > >> >> > > >> >> > A 5:18 ratio implies that some of your SSDs hold more journals than others. > >> >> > > >> >> > You emphatically do NOT want that, because eventually the busier ones will > >> >> > run out of endurance while the other ones still have plenty left. > >> >> > > >> >> > If possible change this to a 5:20 or 6:18 ratio (depending on your SSDs > >> >> > and expected write volume). > >> >> > > >> >> > Christian > >> >> >> I have size=3 for my pool. Will Ceph return once the data is written > >> >> >> to at least 3 designated journals, or will it in fact wait until the > >> >> >> data is written to the OSD drives? thx will > >> >> >> _______________________________________________ > >> >> >> ceph-users mailing list > >> >> >> ceph-users@xxxxxxxxxxxxxx > >> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> >> >> > >> >> > > >> >> > > >> >> > -- > >> >> > Christian Balzer Network/Systems Engineer > >> >> > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications > >> >> > http://www.gol.com/ > >> >> > >> > > >> > > >> > -- > >> > Christian Balzer Network/Systems Engineer > >> > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications > >> > http://www.gol.com/ > >> > > > > > > -- > > Christian Balzer Network/Systems Engineer > > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications > > http://www.gol.com/ > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com