Re: RBD with SSD journals and SAS OSDs

Christian Balzer <chibi@xxxxxxx> · Tue, 18 Oct 2016 10:44:28 +0900

Hello,

As I had this written mostly already and since it covers some points Nick
raised in more detail, here we go.

On Mon, 17 Oct 2016 16:30:48 +0800 William Josefsson wrote:

> Thx Christian for helping troubleshooting the latency issues. I have
> attached my fio job template below.
> 
There's no trouble here per se, just facts of life (Ceph).

You'll be well advised to search the ML, especially with what Nick Fisk
had to write about these things (several times).

> I thought to eliminate the factor that the VM is the bottleneck, I've
> created a 128GB 32 cCPU flavor. 
Nope, The client is not the issue.

>Here's the latest fio benchmark.
> http://pastebin.ca/raw/3729693   I'm trying to benchmark the clusters
> performance for SYNCED WRITEs and how well suited it would be for disk
> intensive workloads or DBs
>

A single IOPS of that type and size will only hit the journal and be
ACK'ed quickly (well quicker than what you see now), but FIO is a creating
a constant stream of requests, eventually hitting the actual OSD as well.

Aside from CPU load, of course.

> 
> > The size (45GB) of these journals is only going to be used by a little
> > fraction, unlikely to be more than 1GB in normal operations and with
> > default filestore/journal parameters.
> 
> To consume more of the SSDs in the hope to achieve lower latency, can
> you pls advice what parameters I should be looking at? 

Not going to help with your prolonged FIO runs and once the flushing to
OSDs comments, stalls will ensue.
The moment the journal is full or the timers kick in, things will go down
to OSD (HDD) speed. 
The journal is there to help with small, short bursts.

>I have already
> tried to what's mentioned in RaySun's ceph blog, which eventually
> lowered my overall sync write IOPs performance by 1-2k.
>
Unsurprisingly, the default values are there for a reason.

> # These are from RaySun's  write up, and worsen my total IOPs.
> # http://xiaoquqi.github.io/blog/2015/06/28/ceph-performance-optimization-summary/
> 
> filestore xattr use omap = true
> filestore min sync interval = 10
Way too high, 0.5 is probably already excessive, I run with 0.1.

> filestore max sync interval = 15

> filestore queue max ops = 25000
> filestore queue max bytes = 10485760
> filestore queue committing max ops = 5000
> filestore queue committing max bytes = 10485760000
Your HDDs will choke on those 4. With a 10k SAS HDD a small increase of
the defaults may help.

> journal max write bytes = 1073714824
> journal max write entries = 10000
> journal queue max ops = 50000
> journal queue max bytes = 10485760000
>
> My Journals are Intel s3610 200GB, split in 4-5 partitions each. 
Again, you want to event that out.

>When
> I did FIO on the disks locally with direct=1 and sync=1 the WRITE
> performance was 50k iops for 7 threads.
>
Yes, but as I wrote that's not how journals work, think more of 7
sequential writes, not rand-writes. 

And as I tried to explain before, the SSDs are not the bottleneck, your
CPUs may be and your OSD HDDs eventually will be. 
Run atop on all your nodes when doing those tests and see how much things
get pushed (CPUs, disks, the OSD processes).

> My hardware specs:
> 
> - 3 Controllers, The mons run here
> Dell PE R630, 64GB, Intel SSD s3610
> - 9 Storage nodes
> Dell 730xd, 2x2630v4 2.2Ghz, 512GB, Journal: 5x200GB Intel 3610 SSD,
> OSD: 18x1.8TB Hitachi 10krpm SAS
> 
I can't really fault you for the choice of CPU, but smaller nodes with
higher speed and fewer cores may help with this extreme test case (in
normal production you're fine).

> RAID Controller is PERC 730
> 
> All servers have 2x10GbE bonds, Intel ixgbe X540 copper connecting to
> Arista 7050X 10Gbit Switches with VARP, and LACP interfaces. I have
> from my VM pinged all hosts and the RTT is 0.3ms on the LAN. I did
> iperf, and I can do 10Gbps from the VM to the storage nodes.
> 
Bandwidth is irrelevant in this case, the RTT of 0.3ms feels a bit high.
If you look again at the flow in 
http://docs.ceph.com/docs/hammer/architecture/#smart-daemons-enable-hyperscale

those will add up to a significant part of your Ceph latency.

To elaborate and demonstrate:

I have test cluster, consisting of 4 nodes, 2 of them HDD backed OSDs with
SSD journals and 2 of them SSD based (4x DC S3610 400GB each) as a
cache-tier for the "normal" ones. All replication 2.
So for the purpose of this test, this is all 100% against the SSDs in the
cache-pool only.

The network is IPoIB (QDDR, 40Gb/s Infiniband) with 0.1ms latency between
nodes, CPU is a single E5-2620 v3.

If I run this from a VM:
---
fio --size=1G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4K --iodepth=64
---

We wind up with:
---
  write: io=1024.0MB, bw=34172KB/s, iops=8543, runt= 30685msec
    slat (usec): min=1, max=2874, avg= 4.66, stdev= 7.07
    clat (msec): min=1, max=66, avg= 7.49, stdev= 7.80
     lat (msec): min=1, max=66, avg= 7.49, stdev= 7.80
---
During this run the CPU is the bottleneck, idle is around 60% (of 1200),
all 4 OSD processes eat up nearly 3 CPU "cores".
As I said, small random IOPS is the most stressful thing for Ceph.
CPU performance settings influence this little/not at all, as everything
goes to full speed in less than a second and stays there.

If we change the FIO invocation to plain sequential "--rw=write" the CPU
usage is less than 250% (out of 1200), things are pretty relaxed.
At that point we're basically pushing the edge of latency in all
components involved:
---
  write: io=1024.0MB, bw=37819KB/s, iops=9454, runt= 27726msec
    slat (usec): min=1, max=3834, avg= 3.77, stdev= 8.42
    clat (usec): min=943, max=38129, avg=6764.11, stdev=3262.91
     lat (usec): min=954, max=38135, avg=6768.04, stdev=3263.55
---

If we lower this consequently to just one thread with "--iodepth=1" to see
how fast things could potentially be if we don't saturate everything:
---
    slat (usec): min=12, max=100, avg=21.43, stdev= 7.96
    clat (usec): min=1725, max=5873, avg=2485.46, stdev=256.97
     lat (usec): min=1744, max=5894, avg=2507.35, stdev=257.11
---

So 2.5ms instead of 7ms. Not too shabby.

Now if we do the same run but with CPU governors set to performance we get:
---
    slat (usec): min=6, max=291, avg=17.34, stdev= 8.00
    clat (usec): min=957, max=13754, avg=1425.83, stdev=262.85
     lat (usec): min=968, max=13766, avg=1443.56, stdev=264.54
---

So that's where the CPU tuning comes in.
And this is, in real life where you hopefully don't have thousands of
small sync I/Os at the same time, a pretty decent result.

> I've already been tuning, CPU scaling governor to 'performance' on all
> hosts for all cores. My CEPH release is latest hammer on CentOS7.
> 
Jewel is also supposed to have many improvements in this area, but frankly
I haven't been brave (convinced) enough to upgrade from Hammer yet.

Christian

> The best write currently happens at 62 threads it seems, the IOPS is
> 8.3k for the direct synced writes. The latency and stddev are still
> concerning.. :(
> 
> simple-write-62: (groupid=14, jobs=62): err= 0: pid=2748: Mon Oct 17
> 15:20:05 2016
>   write: io=978.64MB, bw=33397KB/s, iops=8349, runt= 30006msec
>     clat (msec): min=3, max=20, avg= 7.42, stdev= 2.50
>      lat (msec): min=3, max=20, avg= 7.42, stdev= 2.50
>     clat percentiles (usec):
>      |  1.00th=[ 3888],  5.00th=[ 4256], 10.00th=[ 4448], 20.00th=[ 4768],
>      | 30.00th=[ 5088], 40.00th=[ 5984], 50.00th=[ 7904], 60.00th=[ 8384],
>      | 70.00th=[ 8768], 80.00th=[ 9408], 90.00th=[10432], 95.00th=[11584],
>      | 99.00th=[13760], 99.50th=[14784], 99.90th=[16320], 99.95th=[16512],
>      | 99.99th=[17792]
>     bw (KB  /s): min=  315, max=  761, per=1.61%, avg=537.06, stdev=77.13
>     lat (msec) : 4=1.99%, 10=84.54%, 20=13.47%, 50=0.01%
>   cpu          : usr=0.05%, sys=0.35%, ctx=509542, majf=0, minf=1902
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued    : total=r=0/w=250527/d=0, short=r=0/w=0/d=0
> 
> 
> From the above we can tell that the latency for clients doing synced
> writes, is somewhere 5-10ms which seems very high, especially with
> quite high performing hardware, network, and SSD journals. I'm not
> sure whether it may be the syncing from Journal to OSD that causes
> these fluctuations or high latencies.
> 
> Any help or advice would be much appreciates. thx will
> 
> 
> [global]
> bs=4k
> rw=write
> sync=1
> direct=1
> iodepth=1
> filename=${FILE}
> runtime=30
> stonewall=1
> group_reporting
> 
> [simple-write-6]
> numjobs=6
> [simple-write-10]
> numjobs=10
> [simple-write-14]
> numjobs=14
> [simple-write-18]
> numjobs=18
> [simple-write-22]
> numjobs=22
> [simple-write-26]
> numjobs=26
> [simple-write-30]
> numjobs=30
> [simple-write-34]
> numjobs=34
> [simple-write-38]
> numjobs=38
> [simple-write-42]
> numjobs=42
> [simple-write-46]
> numjobs=46
> [simple-write-50]
> numjobs=50
> [simple-write-54]
> numjobs=54
> [simple-write-58]
> numjobs=58
> [simple-write-62]
> numjobs=62
> [simple-write-66]
> numjobs=66
> [simple-write-70]
> numjobs=70
> 
> On Mon, Oct 17, 2016 at 10:47 AM, Christian Balzer <chibi@xxxxxxx> wrote:
> >
> > Hello,
> >
> >
> > On Sun, 16 Oct 2016 19:07:17 +0800 William Josefsson wrote:
> >
> >> Ok thanks for sharing. yes my journals are Intel S3610 200GB, which I
> >> partition in 4 partitions each ~45GB. When I ceph-deploy I declare
> >> these as the journals of the OSDs.
> >>
> > The size (45GB) of these journals is only going to be used by a little
> > fraction, unlikely to be more than 1GB in normal operations and with
> > default filestore/journal parameters.
> >
> > Because those defaults start flushing things (from RAM, the journal never
> > gets read unless there is a crash) to the filestore (OSD HDD) pretty much
> > immediately.
> >
> > Again, use google to search the ML archives.
> >
> >> I was trying to understand the blocking, and how much my SAS OSDs
> >> affected my performance. I have a total of 9 hosts, 158 OSDs each
> >> 1.8TB. The Servers are connected through copper 10Gbit LACP bonds.
> >> My failure domain is by type RACK. The CRUSH rule set is by rack. 3
> >> hosts in each rack. Pool size is =3. I'm running hammer on centos7.
> >>
> >
> > Which begs the question to fully detail your HW (CPUs, RAM), network
> > (topology, what switches, inter-rack/switch links), etc.
> > The reason for this will become obvious below.
> >
> >> I did a simple fio test from one of my xl instances, and got the
> >> results below. The Latency 7.21ms is worrying, is this expected
> >> results? Or is there any way I can further tune my cluster to achieve
> >> better results? thx will
> >>
> >
> >> FIO: sync=1, direct=1, bs=4k
> >>
> > Full command line, please.
> >
> > Small, sync I/Os are by far the hardest thing for Ceph.
> >
> > I can guess what some of the rest was, but it's better to know for sure.
> > Alternatively, additionally, try this please:
> >
> > "fio --size=1G --ioengine=libaio --invalidate=1  --direct=1 --numjobs=1
> > --rw=randwrite --name=fiojob --blocksize=4K --iodepth=32"
> >
> >>
> >> write-50: (groupid=11, jobs=50): err= 0: pid=3945: Sun Oct 16 08:41:15 2016
> >>   write: io=832092KB, bw=27721KB/s, iops=6930, runt= 30017msec
> >>     clat (msec): min=2, max=253, avg= 7.21, stdev= 4.97
> >>      lat (msec): min=2, max=253, avg= 7.21, stdev= 4.97
> >
> > These numbers suggest you did randwrite and aren't all that surprising.
> > If you were to run atop on your OSD nodes while doing that fio run, you'll
> > likely see that both CPUs and individual disk (HDDs) get very busy.
> >
> > There are several things conspiring against Ceph here, the latency of it's
> > own code, the network latency of getting all the individual writes to each
> > replica, the fact that 1000 of these 4K blocks will hit one typical RBD
> > object (4MB) and thus one PG, make 3 OSDs very busy, etc.
> >
> > If you absolutely need low latencies with Ceph, consider dedicated SSD
> > only pools for special need applications (DB) or a cache tier if it fits
> > the profile and avtive working set.
> > Lower Ceph latency in general by having fast CPUs which are have
> > powersaving (frequency throttling) disabled or set to "performance"
> > instead of "ondemand".
> >
> > Christan
> >
> >>     clat percentiles (msec):
> >>      |  1.00th=[    4],  5.00th=[    4], 10.00th=[    5], 20.00th=[    5],
> >>      | 30.00th=[    5], 40.00th=[    6], 50.00th=[    7], 60.00th=[    8],
> >>      | 70.00th=[    9], 80.00th=[   10], 90.00th=[   12], 95.00th=[   14],
> >>      | 99.00th=[   17], 99.50th=[   19], 99.90th=[   21], 99.95th=[   23],
> >>      | 99.99th=[  253]
> >>     bw (KB  /s): min=  341, max=  870, per=2.01%, avg=556.60, stdev=136.98
> >>     lat (msec) : 4=8.24%, 10=74.10%, 20=17.52%, 50=0.12%, 500=0.02%
> >>   cpu          : usr=0.04%, sys=0.23%, ctx=425242, majf=0, minf=1570
> >>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> >>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> >>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> >>      issued    : total=r=0/w=208023/d=0, short=r=0/w=0/d=0
> >>
> >> On Sun, Oct 16, 2016 at 4:18 PM, Christian Balzer <chibi@xxxxxxx> wrote:
> >> >
> >> > Hello,
> >> >
> >> > On Sun, 16 Oct 2016 15:03:24 +0800 William Josefsson wrote:
> >> >
> >> >> Hi list, while I know that writes in the RADOS backend are sync() can
> >> >> anyone please explain when the cluster will return on a write call for
> >> >> RBD from VMs? Will data be considered synced one written to the
> >> >> journal or all the way to the OSD drive?
> >> >>
> >> > This has been answered countless (really) here, the Ceph Architecture
> >> > documentation should really be more detailed about this, as well as how
> >> > parallel the data is being sent to the secondary OSDs.
> >> >
> >> > It is of course ack'ed to the client once all journals have successfully
> >> > written the data, otherwise journal SSDs would make a LOT less sense.
> >> >
> >> >> Each host in my cluster has 5x Intel S3610, and 18x1.8TB Hitachi 10krpm SAS.
> >> >>
> >> > The size of your SSDs (you didn't mention) will determine the speed, for
> >> > journal purposes the sequential write speed is basically it.
> >> >
> >> > A 5:18 ratio implies that some of your SSDs hold more journals than others.
> >> >
> >> > You emphatically do NOT want that, because eventually the busier ones will
> >> > run out of endurance while the other ones still have plenty left.
> >> >
> >> > If possible change this to a 5:20 or 6:18 ratio (depending on your SSDs
> >> > and expected write volume).
> >> >
> >> > Christian
> >> >> I have size=3 for my pool. Will Ceph return once the data is written
> >> >> to at least 3 designated journals, or will it in fact wait until the
> >> >> data is written to the OSD drives? thx will
> >> >> _______________________________________________
> >> >> ceph-users mailing list
> >> >> ceph-users@xxxxxxxxxxxxxx
> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >>
> >> >
> >> >
> >> > --
> >> > Christian Balzer        Network/Systems Engineer
> >> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> >> > http://www.gol.com/
> >>
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com