Re: RBD with SSD journals and SAS OSDs

Christian Balzer <chibi@xxxxxxx> · Wed, 19 Oct 2016 12:05:05 +0900

Hello,

Note that the tests below were done on a VM with RBD cache disabled, so the
"direct=1" flag in FIO had a similar impact to "sync=1".

If your databases are MySQL, Oracle or something else that can use
O_DIRECT, RBD caching can improve things dramatically for you (with the
same risks that on-disk caches pose).

Compare this FIO on the non-cache VM:
(fio --size=1G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1
--rw=write --name=fiojob --blocksize=4K --iodepth=1)
---
  write: io=1024.0MB, bw=2785.9KB/s, iops=696, runt=376395msec
    slat (usec): min=5, max=307, avg=15.58, stdev= 7.09
    clat (usec): min=930, max=26628, avg=1417.07, stdev=251.83
     lat (usec): min=944, max=26654, avg=1432.96, stdev=253.30
---

to the same one on a VM with default RBD caching:
---
  write: io=1024.0MB, bw=118269KB/s, iops=29567, runt=  8866msec
    slat (usec): min=2, max=241, avg= 3.93, stdev= 1.32
    clat (usec): min=0, max=24091, avg=29.29, stdev=59.01
     lat (usec): min=29, max=24095, avg=33.34, stdev=59.04
---

Christian

On Tue, 18 Oct 2016 23:55:31 +0800 William Josefsson wrote:

> Thx Christian for elaborating on this appreciate it, I will rerun some
> of my benchmarks and take your advice into consideration. I have also
> found maximum performance recommendations for the dell 730xd bios
> settings, hope these make sense: http://pasteboard.co/guHVMQVly.jpg
> I will set all these settings, and intel_idle.max_cstate=0 as
> suggested by Nick and rerun fio benchmarks. thx will
> 
> 
> On Tue, Oct 18, 2016 at 9:44 AM, Christian Balzer <chibi@xxxxxxx> wrote:
> >
> > Hello,
> >
> > As I had this written mostly already and since it covers some points Nick
> > raised in more detail, here we go.
> >
> > On Mon, 17 Oct 2016 16:30:48 +0800 William Josefsson wrote:
> >
> >> Thx Christian for helping troubleshooting the latency issues. I have
> >> attached my fio job template below.
> >>
> > There's no trouble here per se, just facts of life (Ceph).
> >
> > You'll be well advised to search the ML, especially with what Nick Fisk
> > had to write about these things (several times).
> >
> >> I thought to eliminate the factor that the VM is the bottleneck, I've
> >> created a 128GB 32 cCPU flavor.
> > Nope, The client is not the issue.
> >
> >>Here's the latest fio benchmark.
> >> http://pastebin.ca/raw/3729693   I'm trying to benchmark the clusters
> >> performance for SYNCED WRITEs and how well suited it would be for disk
> >> intensive workloads or DBs
> >>
> >
> > A single IOPS of that type and size will only hit the journal and be
> > ACK'ed quickly (well quicker than what you see now), but FIO is a creating
> > a constant stream of requests, eventually hitting the actual OSD as well.
> >
> > Aside from CPU load, of course.
> >
> >>
> >> > The size (45GB) of these journals is only going to be used by a little
> >> > fraction, unlikely to be more than 1GB in normal operations and with
> >> > default filestore/journal parameters.
> >>
> >> To consume more of the SSDs in the hope to achieve lower latency, can
> >> you pls advice what parameters I should be looking at?
> >
> > Not going to help with your prolonged FIO runs and once the flushing to
> > OSDs comments, stalls will ensue.
> > The moment the journal is full or the timers kick in, things will go down
> > to OSD (HDD) speed.
> > The journal is there to help with small, short bursts.
> >
> >>I have already
> >> tried to what's mentioned in RaySun's ceph blog, which eventually
> >> lowered my overall sync write IOPs performance by 1-2k.
> >>
> > Unsurprisingly, the default values are there for a reason.
> >
> >> # These are from RaySun's  write up, and worsen my total IOPs.
> >> # http://xiaoquqi.github.io/blog/2015/06/28/ceph-performance-optimization-summary/
> >>
> >> filestore xattr use omap = true
> >> filestore min sync interval = 10
> > Way too high, 0.5 is probably already excessive, I run with 0.1.
> >
> >> filestore max sync interval = 15
> >
> >> filestore queue max ops = 25000
> >> filestore queue max bytes = 10485760
> >> filestore queue committing max ops = 5000
> >> filestore queue committing max bytes = 10485760000
> > Your HDDs will choke on those 4. With a 10k SAS HDD a small increase of
> > the defaults may help.
> >
> >> journal max write bytes = 1073714824
> >> journal max write entries = 10000
> >> journal queue max ops = 50000
> >> journal queue max bytes = 10485760000
> >>
> >> My Journals are Intel s3610 200GB, split in 4-5 partitions each.
> > Again, you want to event that out.
> >
> >>When
> >> I did FIO on the disks locally with direct=1 and sync=1 the WRITE
> >> performance was 50k iops for 7 threads.
> >>
> > Yes, but as I wrote that's not how journals work, think more of 7
> > sequential writes, not rand-writes.
> >
> > And as I tried to explain before, the SSDs are not the bottleneck, your
> > CPUs may be and your OSD HDDs eventually will be.
> > Run atop on all your nodes when doing those tests and see how much things
> > get pushed (CPUs, disks, the OSD processes).
> >
> >> My hardware specs:
> >>
> >> - 3 Controllers, The mons run here
> >> Dell PE R630, 64GB, Intel SSD s3610
> >> - 9 Storage nodes
> >> Dell 730xd, 2x2630v4 2.2Ghz, 512GB, Journal: 5x200GB Intel 3610 SSD,
> >> OSD: 18x1.8TB Hitachi 10krpm SAS
> >>
> > I can't really fault you for the choice of CPU, but smaller nodes with
> > higher speed and fewer cores may help with this extreme test case (in
> > normal production you're fine).
> >
> >> RAID Controller is PERC 730
> >>
> >> All servers have 2x10GbE bonds, Intel ixgbe X540 copper connecting to
> >> Arista 7050X 10Gbit Switches with VARP, and LACP interfaces. I have
> >> from my VM pinged all hosts and the RTT is 0.3ms on the LAN. I did
> >> iperf, and I can do 10Gbps from the VM to the storage nodes.
> >>
> > Bandwidth is irrelevant in this case, the RTT of 0.3ms feels a bit high.
> > If you look again at the flow in
> > http://docs.ceph.com/docs/hammer/architecture/#smart-daemons-enable-hyperscale
> >
> > those will add up to a significant part of your Ceph latency.
> >
> > To elaborate and demonstrate:
> >
> > I have test cluster, consisting of 4 nodes, 2 of them HDD backed OSDs with
> > SSD journals and 2 of them SSD based (4x DC S3610 400GB each) as a
> > cache-tier for the "normal" ones. All replication 2.
> > So for the purpose of this test, this is all 100% against the SSDs in the
> > cache-pool only.
> >
> > The network is IPoIB (QDDR, 40Gb/s Infiniband) with 0.1ms latency between
> > nodes, CPU is a single E5-2620 v3.
> >
> > If I run this from a VM:
> > ---
> > fio --size=1G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4K --iodepth=64
> > ---
> >
> > We wind up with:
> > ---
> >   write: io=1024.0MB, bw=34172KB/s, iops=8543, runt= 30685msec
> >     slat (usec): min=1, max=2874, avg= 4.66, stdev= 7.07
> >     clat (msec): min=1, max=66, avg= 7.49, stdev= 7.80
> >      lat (msec): min=1, max=66, avg= 7.49, stdev= 7.80
> > ---
> > During this run the CPU is the bottleneck, idle is around 60% (of 1200),
> > all 4 OSD processes eat up nearly 3 CPU "cores".
> > As I said, small random IOPS is the most stressful thing for Ceph.
> > CPU performance settings influence this little/not at all, as everything
> > goes to full speed in less than a second and stays there.
> >
> >
> > If we change the FIO invocation to plain sequential "--rw=write" the CPU
> > usage is less than 250% (out of 1200), things are pretty relaxed.
> > At that point we're basically pushing the edge of latency in all
> > components involved:
> > ---
> >   write: io=1024.0MB, bw=37819KB/s, iops=9454, runt= 27726msec
> >     slat (usec): min=1, max=3834, avg= 3.77, stdev= 8.42
> >     clat (usec): min=943, max=38129, avg=6764.11, stdev=3262.91
> >      lat (usec): min=954, max=38135, avg=6768.04, stdev=3263.55
> > ---
> >
> > If we lower this consequently to just one thread with "--iodepth=1" to see
> > how fast things could potentially be if we don't saturate everything:
> > ---
> >     slat (usec): min=12, max=100, avg=21.43, stdev= 7.96
> >     clat (usec): min=1725, max=5873, avg=2485.46, stdev=256.97
> >      lat (usec): min=1744, max=5894, avg=2507.35, stdev=257.11
> > ---
> >
> > So 2.5ms instead of 7ms. Not too shabby.
> >
> >
> > Now if we do the same run but with CPU governors set to performance we get:
> > ---
> >     slat (usec): min=6, max=291, avg=17.34, stdev= 8.00
> >     clat (usec): min=957, max=13754, avg=1425.83, stdev=262.85
> >      lat (usec): min=968, max=13766, avg=1443.56, stdev=264.54
> > ---
> >
> > So that's where the CPU tuning comes in.
> > And this is, in real life where you hopefully don't have thousands of
> > small sync I/Os at the same time, a pretty decent result.
> >
> >
> >> I've already been tuning, CPU scaling governor to 'performance' on all
> >> hosts for all cores. My CEPH release is latest hammer on CentOS7.
> >>
> > Jewel is also supposed to have many improvements in this area, but frankly
> > I haven't been brave (convinced) enough to upgrade from Hammer yet.
> >
> > Christian
> >
> >> The best write currently happens at 62 threads it seems, the IOPS is
> >> 8.3k for the direct synced writes. The latency and stddev are still
> >> concerning.. :(
> >>
> >> simple-write-62: (groupid=14, jobs=62): err= 0: pid=2748: Mon Oct 17
> >> 15:20:05 2016
> >>   write: io=978.64MB, bw=33397KB/s, iops=8349, runt= 30006msec
> >>     clat (msec): min=3, max=20, avg= 7.42, stdev= 2.50
> >>      lat (msec): min=3, max=20, avg= 7.42, stdev= 2.50
> >>     clat percentiles (usec):
> >>      |  1.00th=[ 3888],  5.00th=[ 4256], 10.00th=[ 4448], 20.00th=[ 4768],
> >>      | 30.00th=[ 5088], 40.00th=[ 5984], 50.00th=[ 7904], 60.00th=[ 8384],
> >>      | 70.00th=[ 8768], 80.00th=[ 9408], 90.00th=[10432], 95.00th=[11584],
> >>      | 99.00th=[13760], 99.50th=[14784], 99.90th=[16320], 99.95th=[16512],
> >>      | 99.99th=[17792]
> >>     bw (KB  /s): min=  315, max=  761, per=1.61%, avg=537.06, stdev=77.13
> >>     lat (msec) : 4=1.99%, 10=84.54%, 20=13.47%, 50=0.01%
> >>   cpu          : usr=0.05%, sys=0.35%, ctx=509542, majf=0, minf=1902
> >>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> >>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> >>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> >>      issued    : total=r=0/w=250527/d=0, short=r=0/w=0/d=0
> >>
> >>
> >> From the above we can tell that the latency for clients doing synced
> >> writes, is somewhere 5-10ms which seems very high, especially with
> >> quite high performing hardware, network, and SSD journals. I'm not
> >> sure whether it may be the syncing from Journal to OSD that causes
> >> these fluctuations or high latencies.
> >>
> >> Any help or advice would be much appreciates. thx will
> >>
> >>
> >> [global]
> >> bs=4k
> >> rw=write
> >> sync=1
> >> direct=1
> >> iodepth=1
> >> filename=${FILE}
> >> runtime=30
> >> stonewall=1
> >> group_reporting
> >>
> >> [simple-write-6]
> >> numjobs=6
> >> [simple-write-10]
> >> numjobs=10
> >> [simple-write-14]
> >> numjobs=14
> >> [simple-write-18]
> >> numjobs=18
> >> [simple-write-22]
> >> numjobs=22
> >> [simple-write-26]
> >> numjobs=26
> >> [simple-write-30]
> >> numjobs=30
> >> [simple-write-34]
> >> numjobs=34
> >> [simple-write-38]
> >> numjobs=38
> >> [simple-write-42]
> >> numjobs=42
> >> [simple-write-46]
> >> numjobs=46
> >> [simple-write-50]
> >> numjobs=50
> >> [simple-write-54]
> >> numjobs=54
> >> [simple-write-58]
> >> numjobs=58
> >> [simple-write-62]
> >> numjobs=62
> >> [simple-write-66]
> >> numjobs=66
> >> [simple-write-70]
> >> numjobs=70
> >>
> >> On Mon, Oct 17, 2016 at 10:47 AM, Christian Balzer <chibi@xxxxxxx> wrote:
> >> >
> >> > Hello,
> >> >
> >> >
> >> > On Sun, 16 Oct 2016 19:07:17 +0800 William Josefsson wrote:
> >> >
> >> >> Ok thanks for sharing. yes my journals are Intel S3610 200GB, which I
> >> >> partition in 4 partitions each ~45GB. When I ceph-deploy I declare
> >> >> these as the journals of the OSDs.
> >> >>
> >> > The size (45GB) of these journals is only going to be used by a little
> >> > fraction, unlikely to be more than 1GB in normal operations and with
> >> > default filestore/journal parameters.
> >> >
> >> > Because those defaults start flushing things (from RAM, the journal never
> >> > gets read unless there is a crash) to the filestore (OSD HDD) pretty much
> >> > immediately.
> >> >
> >> > Again, use google to search the ML archives.
> >> >
> >> >> I was trying to understand the blocking, and how much my SAS OSDs
> >> >> affected my performance. I have a total of 9 hosts, 158 OSDs each
> >> >> 1.8TB. The Servers are connected through copper 10Gbit LACP bonds.
> >> >> My failure domain is by type RACK. The CRUSH rule set is by rack. 3
> >> >> hosts in each rack. Pool size is =3. I'm running hammer on centos7.
> >> >>
> >> >
> >> > Which begs the question to fully detail your HW (CPUs, RAM), network
> >> > (topology, what switches, inter-rack/switch links), etc.
> >> > The reason for this will become obvious below.
> >> >
> >> >> I did a simple fio test from one of my xl instances, and got the
> >> >> results below. The Latency 7.21ms is worrying, is this expected
> >> >> results? Or is there any way I can further tune my cluster to achieve
> >> >> better results? thx will
> >> >>
> >> >
> >> >> FIO: sync=1, direct=1, bs=4k
> >> >>
> >> > Full command line, please.
> >> >
> >> > Small, sync I/Os are by far the hardest thing for Ceph.
> >> >
> >> > I can guess what some of the rest was, but it's better to know for sure.
> >> > Alternatively, additionally, try this please:
> >> >
> >> > "fio --size=1G --ioengine=libaio --invalidate=1  --direct=1 --numjobs=1
> >> > --rw=randwrite --name=fiojob --blocksize=4K --iodepth=32"
> >> >
> >> >>
> >> >> write-50: (groupid=11, jobs=50): err= 0: pid=3945: Sun Oct 16 08:41:15 2016
> >> >>   write: io=832092KB, bw=27721KB/s, iops=6930, runt= 30017msec
> >> >>     clat (msec): min=2, max=253, avg= 7.21, stdev= 4.97
> >> >>      lat (msec): min=2, max=253, avg= 7.21, stdev= 4.97
> >> >
> >> > These numbers suggest you did randwrite and aren't all that surprising.
> >> > If you were to run atop on your OSD nodes while doing that fio run, you'll
> >> > likely see that both CPUs and individual disk (HDDs) get very busy.
> >> >
> >> > There are several things conspiring against Ceph here, the latency of it's
> >> > own code, the network latency of getting all the individual writes to each
> >> > replica, the fact that 1000 of these 4K blocks will hit one typical RBD
> >> > object (4MB) and thus one PG, make 3 OSDs very busy, etc.
> >> >
> >> > If you absolutely need low latencies with Ceph, consider dedicated SSD
> >> > only pools for special need applications (DB) or a cache tier if it fits
> >> > the profile and avtive working set.
> >> > Lower Ceph latency in general by having fast CPUs which are have
> >> > powersaving (frequency throttling) disabled or set to "performance"
> >> > instead of "ondemand".
> >> >
> >> > Christan
> >> >
> >> >>     clat percentiles (msec):
> >> >>      |  1.00th=[    4],  5.00th=[    4], 10.00th=[    5], 20.00th=[    5],
> >> >>      | 30.00th=[    5], 40.00th=[    6], 50.00th=[    7], 60.00th=[    8],
> >> >>      | 70.00th=[    9], 80.00th=[   10], 90.00th=[   12], 95.00th=[   14],
> >> >>      | 99.00th=[   17], 99.50th=[   19], 99.90th=[   21], 99.95th=[   23],
> >> >>      | 99.99th=[  253]
> >> >>     bw (KB  /s): min=  341, max=  870, per=2.01%, avg=556.60, stdev=136.98
> >> >>     lat (msec) : 4=8.24%, 10=74.10%, 20=17.52%, 50=0.12%, 500=0.02%
> >> >>   cpu          : usr=0.04%, sys=0.23%, ctx=425242, majf=0, minf=1570
> >> >>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> >> >>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> >> >>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> >> >>      issued    : total=r=0/w=208023/d=0, short=r=0/w=0/d=0
> >> >>
> >> >> On Sun, Oct 16, 2016 at 4:18 PM, Christian Balzer <chibi@xxxxxxx> wrote:
> >> >> >
> >> >> > Hello,
> >> >> >
> >> >> > On Sun, 16 Oct 2016 15:03:24 +0800 William Josefsson wrote:
> >> >> >
> >> >> >> Hi list, while I know that writes in the RADOS backend are sync() can
> >> >> >> anyone please explain when the cluster will return on a write call for
> >> >> >> RBD from VMs? Will data be considered synced one written to the
> >> >> >> journal or all the way to the OSD drive?
> >> >> >>
> >> >> > This has been answered countless (really) here, the Ceph Architecture
> >> >> > documentation should really be more detailed about this, as well as how
> >> >> > parallel the data is being sent to the secondary OSDs.
> >> >> >
> >> >> > It is of course ack'ed to the client once all journals have successfully
> >> >> > written the data, otherwise journal SSDs would make a LOT less sense.
> >> >> >
> >> >> >> Each host in my cluster has 5x Intel S3610, and 18x1.8TB Hitachi 10krpm SAS.
> >> >> >>
> >> >> > The size of your SSDs (you didn't mention) will determine the speed, for
> >> >> > journal purposes the sequential write speed is basically it.
> >> >> >
> >> >> > A 5:18 ratio implies that some of your SSDs hold more journals than others.
> >> >> >
> >> >> > You emphatically do NOT want that, because eventually the busier ones will
> >> >> > run out of endurance while the other ones still have plenty left.
> >> >> >
> >> >> > If possible change this to a 5:20 or 6:18 ratio (depending on your SSDs
> >> >> > and expected write volume).
> >> >> >
> >> >> > Christian
> >> >> >> I have size=3 for my pool. Will Ceph return once the data is written
> >> >> >> to at least 3 designated journals, or will it in fact wait until the
> >> >> >> data is written to the OSD drives? thx will
> >> >> >> _______________________________________________
> >> >> >> ceph-users mailing list
> >> >> >> ceph-users@xxxxxxxxxxxxxx
> >> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >> >>
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Christian Balzer        Network/Systems Engineer
> >> >> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> >> >> > http://www.gol.com/
> >> >>
> >> >
> >> >
> >> > --
> >> > Christian Balzer        Network/Systems Engineer
> >> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> >> > http://www.gol.com/
> >>
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com