Re: RBD with SSD journals and SAS OSDs

William Josefsson <william.josefson@xxxxxxxxx> · Mon, 17 Oct 2016 16:30:48 +0800

Thx Christian for helping troubleshooting the latency issues. I have
attached my fio job template below.

I thought to eliminate the factor that the VM is the bottleneck, I've
created a 128GB 32 cCPU flavor. Here's the latest fio benchmark.
http://pastebin.ca/raw/3729693   I'm trying to benchmark the clusters
performance for SYNCED WRITEs and how well suited it would be for disk
intensive workloads or DBs

> The size (45GB) of these journals is only going to be used by a little
> fraction, unlikely to be more than 1GB in normal operations and with
> default filestore/journal parameters.

To consume more of the SSDs in the hope to achieve lower latency, can
you pls advice what parameters I should be looking at? I have already
tried to what's mentioned in RaySun's ceph blog, which eventually
lowered my overall sync write IOPs performance by 1-2k.

# These are from RaySun's  write up, and worsen my total IOPs.
# http://xiaoquqi.github.io/blog/2015/06/28/ceph-performance-optimization-summary/

filestore xattr use omap = true
filestore min sync interval = 10
filestore max sync interval = 15
filestore queue max ops = 25000
filestore queue max bytes = 10485760
filestore queue committing max ops = 5000
filestore queue committing max bytes = 10485760000
journal max write bytes = 1073714824
journal max write entries = 10000
journal queue max ops = 50000
journal queue max bytes = 10485760000

My Journals are Intel s3610 200GB, split in 4-5 partitions each. When
I did FIO on the disks locally with direct=1 and sync=1 the WRITE
performance was 50k iops for 7 threads.

My hardware specs:

- 3 Controllers, The mons run here
Dell PE R630, 64GB, Intel SSD s3610
- 9 Storage nodes
Dell 730xd, 2x2630v4 2.2Ghz, 512GB, Journal: 5x200GB Intel 3610 SSD,
OSD: 18x1.8TB Hitachi 10krpm SAS

RAID Controller is PERC 730

All servers have 2x10GbE bonds, Intel ixgbe X540 copper connecting to
Arista 7050X 10Gbit Switches with VARP, and LACP interfaces. I have
from my VM pinged all hosts and the RTT is 0.3ms on the LAN. I did
iperf, and I can do 10Gbps from the VM to the storage nodes.

I've already been tuning, CPU scaling governor to 'performance' on all
hosts for all cores. My CEPH release is latest hammer on CentOS7.

The best write currently happens at 62 threads it seems, the IOPS is
8.3k for the direct synced writes. The latency and stddev are still
concerning.. :(

simple-write-62: (groupid=14, jobs=62): err= 0: pid=2748: Mon Oct 17
15:20:05 2016
  write: io=978.64MB, bw=33397KB/s, iops=8349, runt= 30006msec
    clat (msec): min=3, max=20, avg= 7.42, stdev= 2.50
     lat (msec): min=3, max=20, avg= 7.42, stdev= 2.50
    clat percentiles (usec):
     |  1.00th=[ 3888],  5.00th=[ 4256], 10.00th=[ 4448], 20.00th=[ 4768],
     | 30.00th=[ 5088], 40.00th=[ 5984], 50.00th=[ 7904], 60.00th=[ 8384],
     | 70.00th=[ 8768], 80.00th=[ 9408], 90.00th=[10432], 95.00th=[11584],
     | 99.00th=[13760], 99.50th=[14784], 99.90th=[16320], 99.95th=[16512],
     | 99.99th=[17792]
    bw (KB  /s): min=  315, max=  761, per=1.61%, avg=537.06, stdev=77.13
    lat (msec) : 4=1.99%, 10=84.54%, 20=13.47%, 50=0.01%
  cpu          : usr=0.05%, sys=0.35%, ctx=509542, majf=0, minf=1902
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=250527/d=0, short=r=0/w=0/d=0

>From the above we can tell that the latency for clients doing synced
writes, is somewhere 5-10ms which seems very high, especially with
quite high performing hardware, network, and SSD journals. I'm not
sure whether it may be the syncing from Journal to OSD that causes
these fluctuations or high latencies.

Any help or advice would be much appreciates. thx will

[global]
bs=4k
rw=write
sync=1
direct=1
iodepth=1
filename=${FILE}
runtime=30
stonewall=1
group_reporting

[simple-write-6]
numjobs=6
[simple-write-10]
numjobs=10
[simple-write-14]
numjobs=14
[simple-write-18]
numjobs=18
[simple-write-22]
numjobs=22
[simple-write-26]
numjobs=26
[simple-write-30]
numjobs=30
[simple-write-34]
numjobs=34
[simple-write-38]
numjobs=38
[simple-write-42]
numjobs=42
[simple-write-46]
numjobs=46
[simple-write-50]
numjobs=50
[simple-write-54]
numjobs=54
[simple-write-58]
numjobs=58
[simple-write-62]
numjobs=62
[simple-write-66]
numjobs=66
[simple-write-70]
numjobs=70

On Mon, Oct 17, 2016 at 10:47 AM, Christian Balzer <chibi@xxxxxxx> wrote:
>
> Hello,
>
>
> On Sun, 16 Oct 2016 19:07:17 +0800 William Josefsson wrote:
>
>> Ok thanks for sharing. yes my journals are Intel S3610 200GB, which I
>> partition in 4 partitions each ~45GB. When I ceph-deploy I declare
>> these as the journals of the OSDs.
>>
> The size (45GB) of these journals is only going to be used by a little
> fraction, unlikely to be more than 1GB in normal operations and with
> default filestore/journal parameters.
>
> Because those defaults start flushing things (from RAM, the journal never
> gets read unless there is a crash) to the filestore (OSD HDD) pretty much
> immediately.
>
> Again, use google to search the ML archives.
>
>> I was trying to understand the blocking, and how much my SAS OSDs
>> affected my performance. I have a total of 9 hosts, 158 OSDs each
>> 1.8TB. The Servers are connected through copper 10Gbit LACP bonds.
>> My failure domain is by type RACK. The CRUSH rule set is by rack. 3
>> hosts in each rack. Pool size is =3. I'm running hammer on centos7.
>>
>
> Which begs the question to fully detail your HW (CPUs, RAM), network
> (topology, what switches, inter-rack/switch links), etc.
> The reason for this will become obvious below.
>
>> I did a simple fio test from one of my xl instances, and got the
>> results below. The Latency 7.21ms is worrying, is this expected
>> results? Or is there any way I can further tune my cluster to achieve
>> better results? thx will
>>
>
>> FIO: sync=1, direct=1, bs=4k
>>
> Full command line, please.
>
> Small, sync I/Os are by far the hardest thing for Ceph.
>
> I can guess what some of the rest was, but it's better to know for sure.
> Alternatively, additionally, try this please:
>
> "fio --size=1G --ioengine=libaio --invalidate=1  --direct=1 --numjobs=1
> --rw=randwrite --name=fiojob --blocksize=4K --iodepth=32"
>
>>
>> write-50: (groupid=11, jobs=50): err= 0: pid=3945: Sun Oct 16 08:41:15 2016
>>   write: io=832092KB, bw=27721KB/s, iops=6930, runt= 30017msec
>>     clat (msec): min=2, max=253, avg= 7.21, stdev= 4.97
>>      lat (msec): min=2, max=253, avg= 7.21, stdev= 4.97
>
> These numbers suggest you did randwrite and aren't all that surprising.
> If you were to run atop on your OSD nodes while doing that fio run, you'll
> likely see that both CPUs and individual disk (HDDs) get very busy.
>
> There are several things conspiring against Ceph here, the latency of it's
> own code, the network latency of getting all the individual writes to each
> replica, the fact that 1000 of these 4K blocks will hit one typical RBD
> object (4MB) and thus one PG, make 3 OSDs very busy, etc.
>
> If you absolutely need low latencies with Ceph, consider dedicated SSD
> only pools for special need applications (DB) or a cache tier if it fits
> the profile and avtive working set.
> Lower Ceph latency in general by having fast CPUs which are have
> powersaving (frequency throttling) disabled or set to "performance"
> instead of "ondemand".
>
> Christan
>
>>     clat percentiles (msec):
>>      |  1.00th=[    4],  5.00th=[    4], 10.00th=[    5], 20.00th=[    5],
>>      | 30.00th=[    5], 40.00th=[    6], 50.00th=[    7], 60.00th=[    8],
>>      | 70.00th=[    9], 80.00th=[   10], 90.00th=[   12], 95.00th=[   14],
>>      | 99.00th=[   17], 99.50th=[   19], 99.90th=[   21], 99.95th=[   23],
>>      | 99.99th=[  253]
>>     bw (KB  /s): min=  341, max=  870, per=2.01%, avg=556.60, stdev=136.98
>>     lat (msec) : 4=8.24%, 10=74.10%, 20=17.52%, 50=0.12%, 500=0.02%
>>   cpu          : usr=0.04%, sys=0.23%, ctx=425242, majf=0, minf=1570
>>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      issued    : total=r=0/w=208023/d=0, short=r=0/w=0/d=0
>>
>> On Sun, Oct 16, 2016 at 4:18 PM, Christian Balzer <chibi@xxxxxxx> wrote:
>> >
>> > Hello,
>> >
>> > On Sun, 16 Oct 2016 15:03:24 +0800 William Josefsson wrote:
>> >
>> >> Hi list, while I know that writes in the RADOS backend are sync() can
>> >> anyone please explain when the cluster will return on a write call for
>> >> RBD from VMs? Will data be considered synced one written to the
>> >> journal or all the way to the OSD drive?
>> >>
>> > This has been answered countless (really) here, the Ceph Architecture
>> > documentation should really be more detailed about this, as well as how
>> > parallel the data is being sent to the secondary OSDs.
>> >
>> > It is of course ack'ed to the client once all journals have successfully
>> > written the data, otherwise journal SSDs would make a LOT less sense.
>> >
>> >> Each host in my cluster has 5x Intel S3610, and 18x1.8TB Hitachi 10krpm SAS.
>> >>
>> > The size of your SSDs (you didn't mention) will determine the speed, for
>> > journal purposes the sequential write speed is basically it.
>> >
>> > A 5:18 ratio implies that some of your SSDs hold more journals than others.
>> >
>> > You emphatically do NOT want that, because eventually the busier ones will
>> > run out of endurance while the other ones still have plenty left.
>> >
>> > If possible change this to a 5:20 or 6:18 ratio (depending on your SSDs
>> > and expected write volume).
>> >
>> > Christian
>> >> I have size=3 for my pool. Will Ceph return once the data is written
>> >> to at least 3 designated journals, or will it in fact wait until the
>> >> data is written to the OSD drives? thx will
>> >> _______________________________________________
>> >> ceph-users mailing list
>> >> ceph-users@xxxxxxxxxxxxxx
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>
>> >
>> >
>> > --
>> > Christian Balzer        Network/Systems Engineer
>> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
>> > http://www.gol.com/
>>
>
>
> --
> Christian Balzer        Network/Systems Engineer
> chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com