Re: RBD with SSD journals and SAS OSDs

William Josefsson <william.josefson@xxxxxxxxx> · Mon, 17 Oct 2016 17:38:52 +0800

hi nick, I earlier did cpupower frequency-set --cpu-governor
performance on all my hosts, which bumped all CPUs up to almost max
speed or more.

It didn't really help much, and I still experience 5-10ms latency in
my fio benchmarks in VMs with this job description.

Is there anything else I can do to force the SSDs to be used more? I
know DIRECT SYNCED WRITE may not be the most common application case,
however I need help to improve a worst case. Benchmarking these ssd
locally with fio and direct sync write, can do 40-50k IOPS.  I'm not
sure exactly what, but something is holding back the max performance.
I know the journals are sparely used from collectd graphs. appreciate
any advice. thx will

>> [global]
>> bs=4k
>> rw=write
>> sync=1
>> direct=1
>> iodepth=1
>> filename=/dev/vdb1
>> runtime=30
>> stonewall=1
>> group_reporting

grep "cpu MHz" /proc/cpuinfo
cpu MHz         : 2945.250
cpu MHz         : 2617.500
cpu MHz         : 3065.062
cpu MHz         : 2574.281
cpu MHz         : 2739.468
cpu MHz         : 2857.593
cpu MHz         : 2602.125
cpu MHz         : 2581.687
cpu MHz         : 2958.656
cpu MHz         : 2793.093
cpu MHz         : 2682.750
cpu MHz         : 2699.718
cpu MHz         : 2620.125
cpu MHz         : 2926.875
cpu MHz         : 2740.031
cpu MHz         : 2559.656
cpu MHz         : 2758.875
cpu MHz         : 2656.593
cpu MHz         : 1476.187
cpu MHz         : 2545.125
cpu MHz         : 2792.718
cpu MHz         : 2630.156
cpu MHz         : 3090.750
cpu MHz         : 2951.906
cpu MHz         : 2845.875
cpu MHz         : 2553.281
cpu MHz         : 2602.125
cpu MHz         : 2600.906
cpu MHz         : 2737.031
cpu MHz         : 2552.156
cpu MHz         : 2624.625
cpu MHz         : 2614.125

On Mon, Oct 17, 2016 at 5:17 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of William Josefsson
>> Sent: 17 October 2016 09:31
>> To: Christian Balzer <chibi@xxxxxxx>
>> Cc: ceph-users@xxxxxxxxxxxxxx
>> Subject: Re:  RBD with SSD journals and SAS OSDs
>>
>> Thx Christian for helping troubleshooting the latency issues. I have attached my fio job template below.
>>
>> I thought to eliminate the factor that the VM is the bottleneck, I've created a 128GB 32 cCPU flavor. Here's the latest fio
> benchmark.
>> http://pastebin.ca/raw/3729693   I'm trying to benchmark the clusters
>> performance for SYNCED WRITEs and how well suited it would be for disk intensive workloads or DBs
>>
>>
>> > The size (45GB) of these journals is only going to be used by a little
>> > fraction, unlikely to be more than 1GB in normal operations and with
>> > default filestore/journal parameters.
>>
>> To consume more of the SSDs in the hope to achieve lower latency, can you pls advice what parameters I should be looking at? I
> have
>> already tried to what's mentioned in RaySun's ceph blog, which eventually lowered my overall sync write IOPs performance by 1-2k.
>
> You biggest gains will probably be around forcing the CPU's to max frequency and forcing c-state to 1.
>
> intel_idle.max_cstate=0 on kernel parameters
> and
> echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct ( I think this is the same as performance governor)
>
> Use something like powertop to check that all cores are running at max freq and are staying in cstate1
>
> I have managed to get the latency on my cluster down to about 600us, but with your hardware I don't suspect you would be able to get
> it below ~1-1.5ms best case.
>
>>
>> # These are from RaySun's  write up, and worsen my total IOPs.
>> # http://xiaoquqi.github.io/blog/2015/06/28/ceph-performance-optimization-summary/
>>
>> filestore xattr use omap = true
>> filestore min sync interval = 10
>> filestore max sync interval = 15
>> filestore queue max ops = 25000
>> filestore queue max bytes = 10485760
>> filestore queue committing max ops = 5000 filestore queue committing max bytes = 10485760000 journal max write bytes =
>> 1073714824 journal max write entries = 10000 journal queue max ops = 50000 journal queue max bytes = 10485760000
>>
>> My Journals are Intel s3610 200GB, split in 4-5 partitions each. When I did FIO on the disks locally with direct=1 and sync=1 the
> WRITE
>> performance was 50k iops for 7 threads.
>>
>> My hardware specs:
>>
>> - 3 Controllers, The mons run here
>> Dell PE R630, 64GB, Intel SSD s3610
>> - 9 Storage nodes
>> Dell 730xd, 2x2630v4 2.2Ghz, 512GB, Journal: 5x200GB Intel 3610 SSD,
>> OSD: 18x1.8TB Hitachi 10krpm SAS
>>
>> RAID Controller is PERC 730
>>
>> All servers have 2x10GbE bonds, Intel ixgbe X540 copper connecting to Arista 7050X 10Gbit Switches with VARP, and LACP interfaces.
> I
>> have from my VM pinged all hosts and the RTT is 0.3ms on the LAN. I did iperf, and I can do 10Gbps from the VM to the storage
> nodes.
>>
>> I've already been tuning, CPU scaling governor to 'performance' on all hosts for all cores. My CEPH release is latest hammer on
>> CentOS7.
>>
>> The best write currently happens at 62 threads it seems, the IOPS is 8.3k for the direct synced writes. The latency and stddev are
> still
>> concerning.. :(
>>
>> simple-write-62: (groupid=14, jobs=62): err= 0: pid=2748: Mon Oct 17
>> 15:20:05 2016
>>   write: io=978.64MB, bw=33397KB/s, iops=8349, runt= 30006msec
>>     clat (msec): min=3, max=20, avg= 7.42, stdev= 2.50
>>      lat (msec): min=3, max=20, avg= 7.42, stdev= 2.50
>>     clat percentiles (usec):
>>      |  1.00th=[ 3888],  5.00th=[ 4256], 10.00th=[ 4448], 20.00th=[ 4768],
>>      | 30.00th=[ 5088], 40.00th=[ 5984], 50.00th=[ 7904], 60.00th=[ 8384],
>>      | 70.00th=[ 8768], 80.00th=[ 9408], 90.00th=[10432], 95.00th=[11584],
>>      | 99.00th=[13760], 99.50th=[14784], 99.90th=[16320], 99.95th=[16512],
>>      | 99.99th=[17792]
>>     bw (KB  /s): min=  315, max=  761, per=1.61%, avg=537.06, stdev=77.13
>>     lat (msec) : 4=1.99%, 10=84.54%, 20=13.47%, 50=0.01%
>>   cpu          : usr=0.05%, sys=0.35%, ctx=509542, majf=0, minf=1902
>>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      issued    : total=r=0/w=250527/d=0, short=r=0/w=0/d=0
>>
>>
>> From the above we can tell that the latency for clients doing synced writes, is somewhere 5-10ms which seems very high, especially
>> with quite high performing hardware, network, and SSD journals. I'm not sure whether it may be the syncing from Journal to OSD
> that
>> causes these fluctuations or high latencies.
>>
>> Any help or advice would be much appreciates. thx will
>>
>>
>> [global]
>> bs=4k
>> rw=write
>> sync=1
>> direct=1
>> iodepth=1
>> filename=${FILE}
>> runtime=30
>> stonewall=1
>> group_reporting
>>
>> [simple-write-6]
>> numjobs=6
>> [simple-write-10]
>> numjobs=10
>> [simple-write-14]
>> numjobs=14
>> [simple-write-18]
>> numjobs=18
>> [simple-write-22]
>> numjobs=22
>> [simple-write-26]
>> numjobs=26
>> [simple-write-30]
>> numjobs=30
>> [simple-write-34]
>> numjobs=34
>> [simple-write-38]
>> numjobs=38
>> [simple-write-42]
>> numjobs=42
>> [simple-write-46]
>> numjobs=46
>> [simple-write-50]
>> numjobs=50
>> [simple-write-54]
>> numjobs=54
>> [simple-write-58]
>> numjobs=58
>> [simple-write-62]
>> numjobs=62
>> [simple-write-66]
>> numjobs=66
>> [simple-write-70]
>> numjobs=70
>>
>> On Mon, Oct 17, 2016 at 10:47 AM, Christian Balzer <chibi@xxxxxxx> wrote:
>> >
>> > Hello,
>> >
>> >
>> > On Sun, 16 Oct 2016 19:07:17 +0800 William Josefsson wrote:
>> >
>> >> Ok thanks for sharing. yes my journals are Intel S3610 200GB, which I
>> >> partition in 4 partitions each ~45GB. When I ceph-deploy I declare
>> >> these as the journals of the OSDs.
>> >>
>> > The size (45GB) of these journals is only going to be used by a little
>> > fraction, unlikely to be more than 1GB in normal operations and with
>> > default filestore/journal parameters.
>> >
>> > Because those defaults start flushing things (from RAM, the journal
>> > never gets read unless there is a crash) to the filestore (OSD HDD)
>> > pretty much immediately.
>> >
>> > Again, use google to search the ML archives.
>> >
>> >> I was trying to understand the blocking, and how much my SAS OSDs
>> >> affected my performance. I have a total of 9 hosts, 158 OSDs each
>> >> 1.8TB. The Servers are connected through copper 10Gbit LACP bonds.
>> >> My failure domain is by type RACK. The CRUSH rule set is by rack. 3
>> >> hosts in each rack. Pool size is =3. I'm running hammer on centos7.
>> >>
>> >
>> > Which begs the question to fully detail your HW (CPUs, RAM), network
>> > (topology, what switches, inter-rack/switch links), etc.
>> > The reason for this will become obvious below.
>> >
>> >> I did a simple fio test from one of my xl instances, and got the
>> >> results below. The Latency 7.21ms is worrying, is this expected
>> >> results? Or is there any way I can further tune my cluster to achieve
>> >> better results? thx will
>> >>
>> >
>> >> FIO: sync=1, direct=1, bs=4k
>> >>
>> > Full command line, please.
>> >
>> > Small, sync I/Os are by far the hardest thing for Ceph.
>> >
>> > I can guess what some of the rest was, but it's better to know for sure.
>> > Alternatively, additionally, try this please:
>> >
>> > "fio --size=1G --ioengine=libaio --invalidate=1  --direct=1
>> > --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4K --iodepth=32"
>> >
>> >>
>> >> write-50: (groupid=11, jobs=50): err= 0: pid=3945: Sun Oct 16 08:41:15 2016
>> >>   write: io=832092KB, bw=27721KB/s, iops=6930, runt= 30017msec
>> >>     clat (msec): min=2, max=253, avg= 7.21, stdev= 4.97
>> >>      lat (msec): min=2, max=253, avg= 7.21, stdev= 4.97
>> >
>> > These numbers suggest you did randwrite and aren't all that surprising.
>> > If you were to run atop on your OSD nodes while doing that fio run,
>> > you'll likely see that both CPUs and individual disk (HDDs) get very busy.
>> >
>> > There are several things conspiring against Ceph here, the latency of
>> > it's own code, the network latency of getting all the individual
>> > writes to each replica, the fact that 1000 of these 4K blocks will hit
>> > one typical RBD object (4MB) and thus one PG, make 3 OSDs very busy, etc.
>> >
>> > If you absolutely need low latencies with Ceph, consider dedicated SSD
>> > only pools for special need applications (DB) or a cache tier if it
>> > fits the profile and avtive working set.
>> > Lower Ceph latency in general by having fast CPUs which are have
>> > powersaving (frequency throttling) disabled or set to "performance"
>> > instead of "ondemand".
>> >
>> > Christan
>> >
>> >>     clat percentiles (msec):
>> >>      |  1.00th=[    4],  5.00th=[    4], 10.00th=[    5], 20.00th=[    5],
>> >>      | 30.00th=[    5], 40.00th=[    6], 50.00th=[    7], 60.00th=[    8],
>> >>      | 70.00th=[    9], 80.00th=[   10], 90.00th=[   12], 95.00th=[   14],
>> >>      | 99.00th=[   17], 99.50th=[   19], 99.90th=[   21], 99.95th=[   23],
>> >>      | 99.99th=[  253]
>> >>     bw (KB  /s): min=  341, max=  870, per=2.01%, avg=556.60, stdev=136.98
>> >>     lat (msec) : 4=8.24%, 10=74.10%, 20=17.52%, 50=0.12%, 500=0.02%
>> >>   cpu          : usr=0.04%, sys=0.23%, ctx=425242, majf=0, minf=1570
>> >>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>> >>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>> >>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>> >>      issued    : total=r=0/w=208023/d=0, short=r=0/w=0/d=0
>> >>
>> >> On Sun, Oct 16, 2016 at 4:18 PM, Christian Balzer <chibi@xxxxxxx> wrote:
>> >> >
>> >> > Hello,
>> >> >
>> >> > On Sun, 16 Oct 2016 15:03:24 +0800 William Josefsson wrote:
>> >> >
>> >> >> Hi list, while I know that writes in the RADOS backend are sync()
>> >> >> can anyone please explain when the cluster will return on a write
>> >> >> call for RBD from VMs? Will data be considered synced one written
>> >> >> to the journal or all the way to the OSD drive?
>> >> >>
>> >> > This has been answered countless (really) here, the Ceph
>> >> > Architecture documentation should really be more detailed about
>> >> > this, as well as how parallel the data is being sent to the secondary OSDs.
>> >> >
>> >> > It is of course ack'ed to the client once all journals have
>> >> > successfully written the data, otherwise journal SSDs would make a LOT less sense.
>> >> >
>> >> >> Each host in my cluster has 5x Intel S3610, and 18x1.8TB Hitachi 10krpm SAS.
>> >> >>
>> >> > The size of your SSDs (you didn't mention) will determine the
>> >> > speed, for journal purposes the sequential write speed is basically it.
>> >> >
>> >> > A 5:18 ratio implies that some of your SSDs hold more journals than others.
>> >> >
>> >> > You emphatically do NOT want that, because eventually the busier
>> >> > ones will run out of endurance while the other ones still have plenty left.
>> >> >
>> >> > If possible change this to a 5:20 or 6:18 ratio (depending on your
>> >> > SSDs and expected write volume).
>> >> >
>> >> > Christian
>> >> >> I have size=3 for my pool. Will Ceph return once the data is
>> >> >> written to at least 3 designated journals, or will it in fact wait
>> >> >> until the data is written to the OSD drives? thx will
>> >> >> _______________________________________________
>> >> >> ceph-users mailing list
>> >> >> ceph-users@xxxxxxxxxxxxxx
>> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >>
>> >> >
>> >> >
>> >> > --
>> >> > Christian Balzer        Network/Systems Engineer
>> >> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
>> >> > http://www.gol.com/
>> >>
>> >
>> >
>> > --
>> > Christian Balzer        Network/Systems Engineer
>> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
>> > http://www.gol.com/
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com