Re: RBD with SSD journals and SAS OSDs

William Josefsson <william.josefson@xxxxxxxxx> · Tue, 18 Oct 2016 23:55:31 +0800

Thx Christian for elaborating on this appreciate it, I will rerun some
of my benchmarks and take your advice into consideration. I have also
found maximum performance recommendations for the dell 730xd bios
settings, hope these make sense: http://pasteboard.co/guHVMQVly.jpg
I will set all these settings, and intel_idle.max_cstate=0 as
suggested by Nick and rerun fio benchmarks. thx will

On Tue, Oct 18, 2016 at 9:44 AM, Christian Balzer <chibi@xxxxxxx> wrote:
>
> Hello,
>
> As I had this written mostly already and since it covers some points Nick
> raised in more detail, here we go.
>
> On Mon, 17 Oct 2016 16:30:48 +0800 William Josefsson wrote:
>
>> Thx Christian for helping troubleshooting the latency issues. I have
>> attached my fio job template below.
>>
> There's no trouble here per se, just facts of life (Ceph).
>
> You'll be well advised to search the ML, especially with what Nick Fisk
> had to write about these things (several times).
>
>> I thought to eliminate the factor that the VM is the bottleneck, I've
>> created a 128GB 32 cCPU flavor.
> Nope, The client is not the issue.
>
>>Here's the latest fio benchmark.
>> http://pastebin.ca/raw/3729693   I'm trying to benchmark the clusters
>> performance for SYNCED WRITEs and how well suited it would be for disk
>> intensive workloads or DBs
>>
>
> A single IOPS of that type and size will only hit the journal and be
> ACK'ed quickly (well quicker than what you see now), but FIO is a creating
> a constant stream of requests, eventually hitting the actual OSD as well.
>
> Aside from CPU load, of course.
>
>>
>> > The size (45GB) of these journals is only going to be used by a little
>> > fraction, unlikely to be more than 1GB in normal operations and with
>> > default filestore/journal parameters.
>>
>> To consume more of the SSDs in the hope to achieve lower latency, can
>> you pls advice what parameters I should be looking at?
>
> Not going to help with your prolonged FIO runs and once the flushing to
> OSDs comments, stalls will ensue.
> The moment the journal is full or the timers kick in, things will go down
> to OSD (HDD) speed.
> The journal is there to help with small, short bursts.
>
>>I have already
>> tried to what's mentioned in RaySun's ceph blog, which eventually
>> lowered my overall sync write IOPs performance by 1-2k.
>>
> Unsurprisingly, the default values are there for a reason.
>
>> # These are from RaySun's  write up, and worsen my total IOPs.
>> # http://xiaoquqi.github.io/blog/2015/06/28/ceph-performance-optimization-summary/
>>
>> filestore xattr use omap = true
>> filestore min sync interval = 10
> Way too high, 0.5 is probably already excessive, I run with 0.1.
>
>> filestore max sync interval = 15
>
>> filestore queue max ops = 25000
>> filestore queue max bytes = 10485760
>> filestore queue committing max ops = 5000
>> filestore queue committing max bytes = 10485760000
> Your HDDs will choke on those 4. With a 10k SAS HDD a small increase of
> the defaults may help.
>
>> journal max write bytes = 1073714824
>> journal max write entries = 10000
>> journal queue max ops = 50000
>> journal queue max bytes = 10485760000
>>
>> My Journals are Intel s3610 200GB, split in 4-5 partitions each.
> Again, you want to event that out.
>
>>When
>> I did FIO on the disks locally with direct=1 and sync=1 the WRITE
>> performance was 50k iops for 7 threads.
>>
> Yes, but as I wrote that's not how journals work, think more of 7
> sequential writes, not rand-writes.
>
> And as I tried to explain before, the SSDs are not the bottleneck, your
> CPUs may be and your OSD HDDs eventually will be.
> Run atop on all your nodes when doing those tests and see how much things
> get pushed (CPUs, disks, the OSD processes).
>
>> My hardware specs:
>>
>> - 3 Controllers, The mons run here
>> Dell PE R630, 64GB, Intel SSD s3610
>> - 9 Storage nodes
>> Dell 730xd, 2x2630v4 2.2Ghz, 512GB, Journal: 5x200GB Intel 3610 SSD,
>> OSD: 18x1.8TB Hitachi 10krpm SAS
>>
> I can't really fault you for the choice of CPU, but smaller nodes with
> higher speed and fewer cores may help with this extreme test case (in
> normal production you're fine).
>
>> RAID Controller is PERC 730
>>
>> All servers have 2x10GbE bonds, Intel ixgbe X540 copper connecting to
>> Arista 7050X 10Gbit Switches with VARP, and LACP interfaces. I have
>> from my VM pinged all hosts and the RTT is 0.3ms on the LAN. I did
>> iperf, and I can do 10Gbps from the VM to the storage nodes.
>>
> Bandwidth is irrelevant in this case, the RTT of 0.3ms feels a bit high.
> If you look again at the flow in
> http://docs.ceph.com/docs/hammer/architecture/#smart-daemons-enable-hyperscale
>
> those will add up to a significant part of your Ceph latency.
>
> To elaborate and demonstrate:
>
> I have test cluster, consisting of 4 nodes, 2 of them HDD backed OSDs with
> SSD journals and 2 of them SSD based (4x DC S3610 400GB each) as a
> cache-tier for the "normal" ones. All replication 2.
> So for the purpose of this test, this is all 100% against the SSDs in the
> cache-pool only.
>
> The network is IPoIB (QDDR, 40Gb/s Infiniband) with 0.1ms latency between
> nodes, CPU is a single E5-2620 v3.
>
> If I run this from a VM:
> ---
> fio --size=1G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4K --iodepth=64
> ---
>
> We wind up with:
> ---
>   write: io=1024.0MB, bw=34172KB/s, iops=8543, runt= 30685msec
>     slat (usec): min=1, max=2874, avg= 4.66, stdev= 7.07
>     clat (msec): min=1, max=66, avg= 7.49, stdev= 7.80
>      lat (msec): min=1, max=66, avg= 7.49, stdev= 7.80
> ---
> During this run the CPU is the bottleneck, idle is around 60% (of 1200),
> all 4 OSD processes eat up nearly 3 CPU "cores".
> As I said, small random IOPS is the most stressful thing for Ceph.
> CPU performance settings influence this little/not at all, as everything
> goes to full speed in less than a second and stays there.
>
>
> If we change the FIO invocation to plain sequential "--rw=write" the CPU
> usage is less than 250% (out of 1200), things are pretty relaxed.
> At that point we're basically pushing the edge of latency in all
> components involved:
> ---
>   write: io=1024.0MB, bw=37819KB/s, iops=9454, runt= 27726msec
>     slat (usec): min=1, max=3834, avg= 3.77, stdev= 8.42
>     clat (usec): min=943, max=38129, avg=6764.11, stdev=3262.91
>      lat (usec): min=954, max=38135, avg=6768.04, stdev=3263.55
> ---
>
> If we lower this consequently to just one thread with "--iodepth=1" to see
> how fast things could potentially be if we don't saturate everything:
> ---
>     slat (usec): min=12, max=100, avg=21.43, stdev= 7.96
>     clat (usec): min=1725, max=5873, avg=2485.46, stdev=256.97
>      lat (usec): min=1744, max=5894, avg=2507.35, stdev=257.11
> ---
>
> So 2.5ms instead of 7ms. Not too shabby.
>
>
> Now if we do the same run but with CPU governors set to performance we get:
> ---
>     slat (usec): min=6, max=291, avg=17.34, stdev= 8.00
>     clat (usec): min=957, max=13754, avg=1425.83, stdev=262.85
>      lat (usec): min=968, max=13766, avg=1443.56, stdev=264.54
> ---
>
> So that's where the CPU tuning comes in.
> And this is, in real life where you hopefully don't have thousands of
> small sync I/Os at the same time, a pretty decent result.
>
>
>> I've already been tuning, CPU scaling governor to 'performance' on all
>> hosts for all cores. My CEPH release is latest hammer on CentOS7.
>>
> Jewel is also supposed to have many improvements in this area, but frankly
> I haven't been brave (convinced) enough to upgrade from Hammer yet.
>
> Christian
>
>> The best write currently happens at 62 threads it seems, the IOPS is
>> 8.3k for the direct synced writes. The latency and stddev are still
>> concerning.. :(
>>
>> simple-write-62: (groupid=14, jobs=62): err= 0: pid=2748: Mon Oct 17
>> 15:20:05 2016
>>   write: io=978.64MB, bw=33397KB/s, iops=8349, runt= 30006msec
>>     clat (msec): min=3, max=20, avg= 7.42, stdev= 2.50
>>      lat (msec): min=3, max=20, avg= 7.42, stdev= 2.50
>>     clat percentiles (usec):
>>      |  1.00th=[ 3888],  5.00th=[ 4256], 10.00th=[ 4448], 20.00th=[ 4768],
>>      | 30.00th=[ 5088], 40.00th=[ 5984], 50.00th=[ 7904], 60.00th=[ 8384],
>>      | 70.00th=[ 8768], 80.00th=[ 9408], 90.00th=[10432], 95.00th=[11584],
>>      | 99.00th=[13760], 99.50th=[14784], 99.90th=[16320], 99.95th=[16512],
>>      | 99.99th=[17792]
>>     bw (KB  /s): min=  315, max=  761, per=1.61%, avg=537.06, stdev=77.13
>>     lat (msec) : 4=1.99%, 10=84.54%, 20=13.47%, 50=0.01%
>>   cpu          : usr=0.05%, sys=0.35%, ctx=509542, majf=0, minf=1902
>>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      issued    : total=r=0/w=250527/d=0, short=r=0/w=0/d=0
>>
>>
>> From the above we can tell that the latency for clients doing synced
>> writes, is somewhere 5-10ms which seems very high, especially with
>> quite high performing hardware, network, and SSD journals. I'm not
>> sure whether it may be the syncing from Journal to OSD that causes
>> these fluctuations or high latencies.
>>
>> Any help or advice would be much appreciates. thx will
>>
>>
>> [global]
>> bs=4k
>> rw=write
>> sync=1
>> direct=1
>> iodepth=1
>> filename=${FILE}
>> runtime=30
>> stonewall=1
>> group_reporting
>>
>> [simple-write-6]
>> numjobs=6
>> [simple-write-10]
>> numjobs=10
>> [simple-write-14]
>> numjobs=14
>> [simple-write-18]
>> numjobs=18
>> [simple-write-22]
>> numjobs=22
>> [simple-write-26]
>> numjobs=26
>> [simple-write-30]
>> numjobs=30
>> [simple-write-34]
>> numjobs=34
>> [simple-write-38]
>> numjobs=38
>> [simple-write-42]
>> numjobs=42
>> [simple-write-46]
>> numjobs=46
>> [simple-write-50]
>> numjobs=50
>> [simple-write-54]
>> numjobs=54
>> [simple-write-58]
>> numjobs=58
>> [simple-write-62]
>> numjobs=62
>> [simple-write-66]
>> numjobs=66
>> [simple-write-70]
>> numjobs=70
>>
>> On Mon, Oct 17, 2016 at 10:47 AM, Christian Balzer <chibi@xxxxxxx> wrote:
>> >
>> > Hello,
>> >
>> >
>> > On Sun, 16 Oct 2016 19:07:17 +0800 William Josefsson wrote:
>> >
>> >> Ok thanks for sharing. yes my journals are Intel S3610 200GB, which I
>> >> partition in 4 partitions each ~45GB. When I ceph-deploy I declare
>> >> these as the journals of the OSDs.
>> >>
>> > The size (45GB) of these journals is only going to be used by a little
>> > fraction, unlikely to be more than 1GB in normal operations and with
>> > default filestore/journal parameters.
>> >
>> > Because those defaults start flushing things (from RAM, the journal never
>> > gets read unless there is a crash) to the filestore (OSD HDD) pretty much
>> > immediately.
>> >
>> > Again, use google to search the ML archives.
>> >
>> >> I was trying to understand the blocking, and how much my SAS OSDs
>> >> affected my performance. I have a total of 9 hosts, 158 OSDs each
>> >> 1.8TB. The Servers are connected through copper 10Gbit LACP bonds.
>> >> My failure domain is by type RACK. The CRUSH rule set is by rack. 3
>> >> hosts in each rack. Pool size is =3. I'm running hammer on centos7.
>> >>
>> >
>> > Which begs the question to fully detail your HW (CPUs, RAM), network
>> > (topology, what switches, inter-rack/switch links), etc.
>> > The reason for this will become obvious below.
>> >
>> >> I did a simple fio test from one of my xl instances, and got the
>> >> results below. The Latency 7.21ms is worrying, is this expected
>> >> results? Or is there any way I can further tune my cluster to achieve
>> >> better results? thx will
>> >>
>> >
>> >> FIO: sync=1, direct=1, bs=4k
>> >>
>> > Full command line, please.
>> >
>> > Small, sync I/Os are by far the hardest thing for Ceph.
>> >
>> > I can guess what some of the rest was, but it's better to know for sure.
>> > Alternatively, additionally, try this please:
>> >
>> > "fio --size=1G --ioengine=libaio --invalidate=1  --direct=1 --numjobs=1
>> > --rw=randwrite --name=fiojob --blocksize=4K --iodepth=32"
>> >
>> >>
>> >> write-50: (groupid=11, jobs=50): err= 0: pid=3945: Sun Oct 16 08:41:15 2016
>> >>   write: io=832092KB, bw=27721KB/s, iops=6930, runt= 30017msec
>> >>     clat (msec): min=2, max=253, avg= 7.21, stdev= 4.97
>> >>      lat (msec): min=2, max=253, avg= 7.21, stdev= 4.97
>> >
>> > These numbers suggest you did randwrite and aren't all that surprising.
>> > If you were to run atop on your OSD nodes while doing that fio run, you'll
>> > likely see that both CPUs and individual disk (HDDs) get very busy.
>> >
>> > There are several things conspiring against Ceph here, the latency of it's
>> > own code, the network latency of getting all the individual writes to each
>> > replica, the fact that 1000 of these 4K blocks will hit one typical RBD
>> > object (4MB) and thus one PG, make 3 OSDs very busy, etc.
>> >
>> > If you absolutely need low latencies with Ceph, consider dedicated SSD
>> > only pools for special need applications (DB) or a cache tier if it fits
>> > the profile and avtive working set.
>> > Lower Ceph latency in general by having fast CPUs which are have
>> > powersaving (frequency throttling) disabled or set to "performance"
>> > instead of "ondemand".
>> >
>> > Christan
>> >
>> >>     clat percentiles (msec):
>> >>      |  1.00th=[    4],  5.00th=[    4], 10.00th=[    5], 20.00th=[    5],
>> >>      | 30.00th=[    5], 40.00th=[    6], 50.00th=[    7], 60.00th=[    8],
>> >>      | 70.00th=[    9], 80.00th=[   10], 90.00th=[   12], 95.00th=[   14],
>> >>      | 99.00th=[   17], 99.50th=[   19], 99.90th=[   21], 99.95th=[   23],
>> >>      | 99.99th=[  253]
>> >>     bw (KB  /s): min=  341, max=  870, per=2.01%, avg=556.60, stdev=136.98
>> >>     lat (msec) : 4=8.24%, 10=74.10%, 20=17.52%, 50=0.12%, 500=0.02%
>> >>   cpu          : usr=0.04%, sys=0.23%, ctx=425242, majf=0, minf=1570
>> >>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>> >>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>> >>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>> >>      issued    : total=r=0/w=208023/d=0, short=r=0/w=0/d=0
>> >>
>> >> On Sun, Oct 16, 2016 at 4:18 PM, Christian Balzer <chibi@xxxxxxx> wrote:
>> >> >
>> >> > Hello,
>> >> >
>> >> > On Sun, 16 Oct 2016 15:03:24 +0800 William Josefsson wrote:
>> >> >
>> >> >> Hi list, while I know that writes in the RADOS backend are sync() can
>> >> >> anyone please explain when the cluster will return on a write call for
>> >> >> RBD from VMs? Will data be considered synced one written to the
>> >> >> journal or all the way to the OSD drive?
>> >> >>
>> >> > This has been answered countless (really) here, the Ceph Architecture
>> >> > documentation should really be more detailed about this, as well as how
>> >> > parallel the data is being sent to the secondary OSDs.
>> >> >
>> >> > It is of course ack'ed to the client once all journals have successfully
>> >> > written the data, otherwise journal SSDs would make a LOT less sense.
>> >> >
>> >> >> Each host in my cluster has 5x Intel S3610, and 18x1.8TB Hitachi 10krpm SAS.
>> >> >>
>> >> > The size of your SSDs (you didn't mention) will determine the speed, for
>> >> > journal purposes the sequential write speed is basically it.
>> >> >
>> >> > A 5:18 ratio implies that some of your SSDs hold more journals than others.
>> >> >
>> >> > You emphatically do NOT want that, because eventually the busier ones will
>> >> > run out of endurance while the other ones still have plenty left.
>> >> >
>> >> > If possible change this to a 5:20 or 6:18 ratio (depending on your SSDs
>> >> > and expected write volume).
>> >> >
>> >> > Christian
>> >> >> I have size=3 for my pool. Will Ceph return once the data is written
>> >> >> to at least 3 designated journals, or will it in fact wait until the
>> >> >> data is written to the OSD drives? thx will
>> >> >> _______________________________________________
>> >> >> ceph-users mailing list
>> >> >> ceph-users@xxxxxxxxxxxxxx
>> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >>
>> >> >
>> >> >
>> >> > --
>> >> > Christian Balzer        Network/Systems Engineer
>> >> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
>> >> > http://www.gol.com/
>> >>
>> >
>> >
>> > --
>> > Christian Balzer        Network/Systems Engineer
>> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
>> > http://www.gol.com/
>>
>
>
> --
> Christian Balzer        Network/Systems Engineer
> chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com