Re: RBD with SSD journals and SAS OSDs

Nick Fisk <nick@xxxxxxxxxx> · Mon, 17 Oct 2016 10:17:58 +0100

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of William Josefsson
> Sent: 17 October 2016 09:31
> To: Christian Balzer <chibi@xxxxxxx>
> Cc: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  RBD with SSD journals and SAS OSDs
> 
> Thx Christian for helping troubleshooting the latency issues. I have attached my fio job template below.
> 
> I thought to eliminate the factor that the VM is the bottleneck, I've created a 128GB 32 cCPU flavor. Here's the latest fio
benchmark.
> http://pastebin.ca/raw/3729693   I'm trying to benchmark the clusters
> performance for SYNCED WRITEs and how well suited it would be for disk intensive workloads or DBs
> 
> 
> > The size (45GB) of these journals is only going to be used by a little
> > fraction, unlikely to be more than 1GB in normal operations and with
> > default filestore/journal parameters.
> 
> To consume more of the SSDs in the hope to achieve lower latency, can you pls advice what parameters I should be looking at? I
have
> already tried to what's mentioned in RaySun's ceph blog, which eventually lowered my overall sync write IOPs performance by 1-2k.

You biggest gains will probably be around forcing the CPU's to max frequency and forcing c-state to 1.

intel_idle.max_cstate=0 on kernel parameters
and
echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct ( I think this is the same as performance governor) 

Use something like powertop to check that all cores are running at max freq and are staying in cstate1

I have managed to get the latency on my cluster down to about 600us, but with your hardware I don't suspect you would be able to get
it below ~1-1.5ms best case.

> 
> # These are from RaySun's  write up, and worsen my total IOPs.
> # http://xiaoquqi.github.io/blog/2015/06/28/ceph-performance-optimization-summary/
> 
> filestore xattr use omap = true
> filestore min sync interval = 10
> filestore max sync interval = 15
> filestore queue max ops = 25000
> filestore queue max bytes = 10485760
> filestore queue committing max ops = 5000 filestore queue committing max bytes = 10485760000 journal max write bytes =
> 1073714824 journal max write entries = 10000 journal queue max ops = 50000 journal queue max bytes = 10485760000
> 
> My Journals are Intel s3610 200GB, split in 4-5 partitions each. When I did FIO on the disks locally with direct=1 and sync=1 the
WRITE
> performance was 50k iops for 7 threads.
> 
> My hardware specs:
> 
> - 3 Controllers, The mons run here
> Dell PE R630, 64GB, Intel SSD s3610
> - 9 Storage nodes
> Dell 730xd, 2x2630v4 2.2Ghz, 512GB, Journal: 5x200GB Intel 3610 SSD,
> OSD: 18x1.8TB Hitachi 10krpm SAS
> 
> RAID Controller is PERC 730
> 
> All servers have 2x10GbE bonds, Intel ixgbe X540 copper connecting to Arista 7050X 10Gbit Switches with VARP, and LACP interfaces.
I
> have from my VM pinged all hosts and the RTT is 0.3ms on the LAN. I did iperf, and I can do 10Gbps from the VM to the storage
nodes.
> 
> I've already been tuning, CPU scaling governor to 'performance' on all hosts for all cores. My CEPH release is latest hammer on
> CentOS7.
> 
> The best write currently happens at 62 threads it seems, the IOPS is 8.3k for the direct synced writes. The latency and stddev are
still
> concerning.. :(
> 
> simple-write-62: (groupid=14, jobs=62): err= 0: pid=2748: Mon Oct 17
> 15:20:05 2016
>   write: io=978.64MB, bw=33397KB/s, iops=8349, runt= 30006msec
>     clat (msec): min=3, max=20, avg= 7.42, stdev= 2.50
>      lat (msec): min=3, max=20, avg= 7.42, stdev= 2.50
>     clat percentiles (usec):
>      |  1.00th=[ 3888],  5.00th=[ 4256], 10.00th=[ 4448], 20.00th=[ 4768],
>      | 30.00th=[ 5088], 40.00th=[ 5984], 50.00th=[ 7904], 60.00th=[ 8384],
>      | 70.00th=[ 8768], 80.00th=[ 9408], 90.00th=[10432], 95.00th=[11584],
>      | 99.00th=[13760], 99.50th=[14784], 99.90th=[16320], 99.95th=[16512],
>      | 99.99th=[17792]
>     bw (KB  /s): min=  315, max=  761, per=1.61%, avg=537.06, stdev=77.13
>     lat (msec) : 4=1.99%, 10=84.54%, 20=13.47%, 50=0.01%
>   cpu          : usr=0.05%, sys=0.35%, ctx=509542, majf=0, minf=1902
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued    : total=r=0/w=250527/d=0, short=r=0/w=0/d=0
> 
> 
> From the above we can tell that the latency for clients doing synced writes, is somewhere 5-10ms which seems very high, especially
> with quite high performing hardware, network, and SSD journals. I'm not sure whether it may be the syncing from Journal to OSD
that
> causes these fluctuations or high latencies.
> 
> Any help or advice would be much appreciates. thx will
> 
> 
> [global]
> bs=4k
> rw=write
> sync=1
> direct=1
> iodepth=1
> filename=${FILE}
> runtime=30
> stonewall=1
> group_reporting
> 
> [simple-write-6]
> numjobs=6
> [simple-write-10]
> numjobs=10
> [simple-write-14]
> numjobs=14
> [simple-write-18]
> numjobs=18
> [simple-write-22]
> numjobs=22
> [simple-write-26]
> numjobs=26
> [simple-write-30]
> numjobs=30
> [simple-write-34]
> numjobs=34
> [simple-write-38]
> numjobs=38
> [simple-write-42]
> numjobs=42
> [simple-write-46]
> numjobs=46
> [simple-write-50]
> numjobs=50
> [simple-write-54]
> numjobs=54
> [simple-write-58]
> numjobs=58
> [simple-write-62]
> numjobs=62
> [simple-write-66]
> numjobs=66
> [simple-write-70]
> numjobs=70
> 
> On Mon, Oct 17, 2016 at 10:47 AM, Christian Balzer <chibi@xxxxxxx> wrote:
> >
> > Hello,
> >
> >
> > On Sun, 16 Oct 2016 19:07:17 +0800 William Josefsson wrote:
> >
> >> Ok thanks for sharing. yes my journals are Intel S3610 200GB, which I
> >> partition in 4 partitions each ~45GB. When I ceph-deploy I declare
> >> these as the journals of the OSDs.
> >>
> > The size (45GB) of these journals is only going to be used by a little
> > fraction, unlikely to be more than 1GB in normal operations and with
> > default filestore/journal parameters.
> >
> > Because those defaults start flushing things (from RAM, the journal
> > never gets read unless there is a crash) to the filestore (OSD HDD)
> > pretty much immediately.
> >
> > Again, use google to search the ML archives.
> >
> >> I was trying to understand the blocking, and how much my SAS OSDs
> >> affected my performance. I have a total of 9 hosts, 158 OSDs each
> >> 1.8TB. The Servers are connected through copper 10Gbit LACP bonds.
> >> My failure domain is by type RACK. The CRUSH rule set is by rack. 3
> >> hosts in each rack. Pool size is =3. I'm running hammer on centos7.
> >>
> >
> > Which begs the question to fully detail your HW (CPUs, RAM), network
> > (topology, what switches, inter-rack/switch links), etc.
> > The reason for this will become obvious below.
> >
> >> I did a simple fio test from one of my xl instances, and got the
> >> results below. The Latency 7.21ms is worrying, is this expected
> >> results? Or is there any way I can further tune my cluster to achieve
> >> better results? thx will
> >>
> >
> >> FIO: sync=1, direct=1, bs=4k
> >>
> > Full command line, please.
> >
> > Small, sync I/Os are by far the hardest thing for Ceph.
> >
> > I can guess what some of the rest was, but it's better to know for sure.
> > Alternatively, additionally, try this please:
> >
> > "fio --size=1G --ioengine=libaio --invalidate=1  --direct=1
> > --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4K --iodepth=32"
> >
> >>
> >> write-50: (groupid=11, jobs=50): err= 0: pid=3945: Sun Oct 16 08:41:15 2016
> >>   write: io=832092KB, bw=27721KB/s, iops=6930, runt= 30017msec
> >>     clat (msec): min=2, max=253, avg= 7.21, stdev= 4.97
> >>      lat (msec): min=2, max=253, avg= 7.21, stdev= 4.97
> >
> > These numbers suggest you did randwrite and aren't all that surprising.
> > If you were to run atop on your OSD nodes while doing that fio run,
> > you'll likely see that both CPUs and individual disk (HDDs) get very busy.
> >
> > There are several things conspiring against Ceph here, the latency of
> > it's own code, the network latency of getting all the individual
> > writes to each replica, the fact that 1000 of these 4K blocks will hit
> > one typical RBD object (4MB) and thus one PG, make 3 OSDs very busy, etc.
> >
> > If you absolutely need low latencies with Ceph, consider dedicated SSD
> > only pools for special need applications (DB) or a cache tier if it
> > fits the profile and avtive working set.
> > Lower Ceph latency in general by having fast CPUs which are have
> > powersaving (frequency throttling) disabled or set to "performance"
> > instead of "ondemand".
> >
> > Christan
> >
> >>     clat percentiles (msec):
> >>      |  1.00th=[    4],  5.00th=[    4], 10.00th=[    5], 20.00th=[    5],
> >>      | 30.00th=[    5], 40.00th=[    6], 50.00th=[    7], 60.00th=[    8],
> >>      | 70.00th=[    9], 80.00th=[   10], 90.00th=[   12], 95.00th=[   14],
> >>      | 99.00th=[   17], 99.50th=[   19], 99.90th=[   21], 99.95th=[   23],
> >>      | 99.99th=[  253]
> >>     bw (KB  /s): min=  341, max=  870, per=2.01%, avg=556.60, stdev=136.98
> >>     lat (msec) : 4=8.24%, 10=74.10%, 20=17.52%, 50=0.12%, 500=0.02%
> >>   cpu          : usr=0.04%, sys=0.23%, ctx=425242, majf=0, minf=1570
> >>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> >>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> >>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> >>      issued    : total=r=0/w=208023/d=0, short=r=0/w=0/d=0
> >>
> >> On Sun, Oct 16, 2016 at 4:18 PM, Christian Balzer <chibi@xxxxxxx> wrote:
> >> >
> >> > Hello,
> >> >
> >> > On Sun, 16 Oct 2016 15:03:24 +0800 William Josefsson wrote:
> >> >
> >> >> Hi list, while I know that writes in the RADOS backend are sync()
> >> >> can anyone please explain when the cluster will return on a write
> >> >> call for RBD from VMs? Will data be considered synced one written
> >> >> to the journal or all the way to the OSD drive?
> >> >>
> >> > This has been answered countless (really) here, the Ceph
> >> > Architecture documentation should really be more detailed about
> >> > this, as well as how parallel the data is being sent to the secondary OSDs.
> >> >
> >> > It is of course ack'ed to the client once all journals have
> >> > successfully written the data, otherwise journal SSDs would make a LOT less sense.
> >> >
> >> >> Each host in my cluster has 5x Intel S3610, and 18x1.8TB Hitachi 10krpm SAS.
> >> >>
> >> > The size of your SSDs (you didn't mention) will determine the
> >> > speed, for journal purposes the sequential write speed is basically it.
> >> >
> >> > A 5:18 ratio implies that some of your SSDs hold more journals than others.
> >> >
> >> > You emphatically do NOT want that, because eventually the busier
> >> > ones will run out of endurance while the other ones still have plenty left.
> >> >
> >> > If possible change this to a 5:20 or 6:18 ratio (depending on your
> >> > SSDs and expected write volume).
> >> >
> >> > Christian
> >> >> I have size=3 for my pool. Will Ceph return once the data is
> >> >> written to at least 3 designated journals, or will it in fact wait
> >> >> until the data is written to the OSD drives? thx will
> >> >> _______________________________________________
> >> >> ceph-users mailing list
> >> >> ceph-users@xxxxxxxxxxxxxx
> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >>
> >> >
> >> >
> >> > --
> >> > Christian Balzer        Network/Systems Engineer
> >> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> >> > http://www.gol.com/
> >>
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com