Re: Unexpected IOPS Ceph Benchmark Result

Christian Balzer <chibi@xxxxxxx> · Mon, 22 Apr 2019 11:27:40 +0900

Hello,

firstly, this has been discussed here in many incarnation.
And is likely the reason for the silence, a little research goes a long
way.

For starters, do yourself a favor and monitor your Ceph nodes with atop
or collect/graph everything at a very low resolution (5s at least) to get
an idea of what is how busy.
This will also show you what if you're actually dealing with the correct
devices when choosing SSD or HDD pools as well as caching effects, see
below.

Small IOPS stress the CPU part of things significantly and this is where
I'd expect you to hit limits potentially. 
The fact that 2 parallel tests don't improve this also suggests this.

Network attached storage in general and Ceph in particular will suffer for
single thread IOPS due to latency.
Local attached storage is always going to be significantly faster, you
can't compare these two.

Use consistent settings for your FIOs, i.e. direct=1 and libaio for all.
1K IOPS for a single SAS HDD feels very high, you're likely looking at
caching in the OS (20GB < 32GB) and/or controller.
The size of your test will _also_ fit into the combined caches of the
OSDs, explaining your HDD pool speeds as well, provided that was correctly
set up to begin with.

You're saying "array" multiple times, this is not how Ceph works.
Reads come from the acting OSD, not in a RAID0 fashion from all 3 OSDs
that hold the respective object.

Speaking of objects, with 4k IOPS, you're writing to the same OSDs 1000
times, again no gain from distribution here.

This should get you hopefully on the right track.

Christian

On Sun, 21 Apr 2019 13:55:37 +0700 Muhammad Fakhri Abdillah wrote:

> Hey everyone,
> Currently running a 4 node Proxmox cluster with external Ceph cluster (Ceph
> using CentOS 7). 4 nodes Ceph OSD installed, each node have spesification
> like this:
> - 8 Core intel Xeon processor
> - 32GB RAM
> - 2 x 600GB HDD SAS for CentOS (RAID1 as a System)
> - 9 x 1200GB HDD SAS for Data (RAID0 each, bluestore), with 2 x 480GB SSD
> for block.db & block.wal
> - 3 x 960GB SSD for faster pool (RAID0 each, bluestore without separate
> block.db & block.wal)
> - 10Gb eth network
> 
> So, total we have 36 OSD hdd and 12 OSD ssd.
> 
> And Here is our network topology :
> 
> https://imgur.com/eAHb18I
> 
> 
> On those cluster, i make 4 pool with 3 replication:
> 1. rbd-data (mount at proxmox for store block data on vm. This pool I set
> on hdd OSD)
> 2. rbd-os (mount at proxmox for store block OS on vm for better
> performance. This pool I set on ssd OSD)
> 3. cephfs-data (using same device and ruleset like rbd-data, mount at
> proxmox as a cephfs-data)
> 4. cephfs-metadata
> 
> Here is our crushmap config (to make sure that we already separate ssd disk
> and hdd disk into different pool and ruleset :
> 
> # begin crush map
> .
> ...
> 
> # buckets
> host z1 {
>         id -3           # do not change unnecessarily
>         id -16 class hdd                # do not change unnecessarily
>         id -22 class ssd                # do not change unnecessarily
>         # weight 10.251
>         alg straw2
>         hash 0  # rjenkins1
>         item osd.0 weight 1.139
>         item osd.1 weight 1.139
>         item osd.2 weight 1.139
>         item osd.3 weight 1.139
>         item osd.4 weight 1.139
>         item osd.5 weight 1.139
>         item osd.6 weight 1.139
>         item osd.7 weight 1.139
>         item osd.8 weight 1.139
> }
> host z2 {
>         id -5           # do not change unnecessarily
>         id -17 class hdd                # do not change unnecessarily
>         id -23 class ssd                # do not change unnecessarily
>         # weight 10.251
>         alg straw2
>         hash 0  # rjenkins1
>         item osd.9 weight 1.139
>         item osd.10 weight 1.139
>         item osd.11 weight 1.139
>         item osd.12 weight 1.139
>         item osd.13 weight 1.139
>         item osd.14 weight 1.139
>         item osd.15 weight 1.139
>         item osd.16 weight 1.139
>         item osd.17 weight 1.139
> }
> host z3 {
>         id -7           # do not change unnecessarily
>         id -18 class hdd                # do not change unnecessarily
>         id -24 class ssd                # do not change unnecessarily
>         # weight 10.251
>         alg straw2
>         hash 0  # rjenkins1
>         item osd.18 weight 1.139
>         item osd.19 weight 1.139
>         item osd.20 weight 1.139
>         item osd.21 weight 1.139
>         item osd.22 weight 1.139
>         item osd.23 weight 1.139
>         item osd.24 weight 1.139
>         item osd.25 weight 1.139
>         item osd.26 weight 1.139
> }
> host s1 {
>         id -9           # do not change unnecessarily
>         id -19 class hdd                # do not change unnecessarily
>         id -25 class ssd                # do not change unnecessarily
>         # weight 10.251
>         alg straw2
>         hash 0  # rjenkins1
>         item osd.27 weight 1.139
>         item osd.28 weight 1.139
>         item osd.29 weight 1.139
>         item osd.30 weight 1.139
>         item osd.31 weight 1.139
>         item osd.32 weight 1.139
>         item osd.33 weight 1.139
>         item osd.34 weight 1.139
>         item osd.35 weight 1.139
> }
> root sas {
>         id -1           # do not change unnecessarily
>         id -21 class hdd                # do not change unnecessarily
>         id -26 class ssd                # do not change unnecessarily
>         # weight 51.496
>         alg straw2
>         hash 0  # rjenkins1
>         item z1 weight 12.874
>         item z2 weight 12.874
>         item z3 weight 12.874
>         item s1 weight 12.874
> }
> host z1-ssd {
>         id -101         # do not change unnecessarily
>         id -2 class hdd         # do not change unnecessarily
>         id -11 class ssd                # do not change unnecessarily
>         # weight 2.619
>         alg straw2
>         hash 0  # rjenkins1
>         item osd.36 weight 0.873
>         item osd.37 weight 0.873
>         item osd.38 weight 0.873
> }
> host z2-ssd {
>         id -104         # do not change unnecessarily
>         id -4 class hdd         # do not change unnecessarily
>         id -12 class ssd                # do not change unnecessarily
>         # weight 2.619
>         alg straw2
>         hash 0  # rjenkins1
>         item osd.39 weight 0.873
>         item osd.40 weight 0.873
>         item osd.41 weight 0.873
> }
> host z3-ssd {
>         id -107         # do not change unnecessarily
>         id -6 class hdd         # do not change unnecessarily
>         id -13 class ssd                # do not change unnecessarily
>         # weight 2.619
>         alg straw2
>         hash 0  # rjenkins1
>         item osd.42 weight 0.873
>         item osd.43 weight 0.873
>         item osd.44 weight 0.873
> }
> host s1-ssd {
>         id -110         # do not change unnecessarily
>         id -8 class hdd         # do not change unnecessarily
>         id -14 class ssd                # do not change unnecessarily
>         # weight 2.619
>         alg straw2
>         hash 0  # rjenkins1
>         item osd.45 weight 0.873
>         item osd.46 weight 0.873
>         item osd.47 weight 0.873
> }
> root ssd {
>         id -20          # do not change unnecessarily
>         id -10 class hdd                # do not change unnecessarily
>         id -15 class ssd                # do not change unnecessarily
>         # weight 10.476
>         alg straw2
>         hash 0  # rjenkins1
>         item z1-ssd weight 2.619
>         item z2-ssd weight 2.619
>         item z3-ssd weight 2.619
>         item s1-ssd weight 2.619
> }
> 
> # rules
> rule sas_ruleset {
>         id 0
>         type replicated
>         min_size 1
>         max_size 10
>         step take sas
>         step chooseleaf firstn 0 type host
>         step emit
> }
> rule ssd_ruleset {
>         id 1
>         type replicated
>         min_size 1
>         max_size 10
>         step take ssd
>         step chooseleaf firstn 0 type host
>         step emit
> }
> rule cephfs_ruleset {
>         id 2
>         type replicated
>         min_size 1
>         max_size 10
>         step take sas
>         step chooseleaf firstn 0 type host
>         step emit
> }
> 
> # end crush map
> 
> So far, we test the system functionality was good and no problem. But we
> need to prove that our system performance is good, especially from IOPS
> perspective.
> 
> Our method to prove it, just like this :
> 
> 1. we try to benchmark each of single disk performance, make them as a base
> performance.
> 2. we calculate theoritically maximum IOPS (read & write) in a whole array
> (base on number 1 test)
> 3. we try to benchmark on 1 VM that the OS using RBD image using pool from
> cluster ceph.
> 
> KPI : we expect minimum 60%-70% max IOPS on VM test is near maximum IOPS in
> a whole array calculation.
> 
> To benchmark the cluster with some test, we use FIO.
> 
> The fio config we use :
> 
> READ RUN
> ioengine=libaio
> sync=0
> fsync=1
> direct=1
> runtime=180
> ramp_time=30
> numjobs=1
> filesize=20g
> 
> WRITE RUN
> ioengine=psync
> direct=0
> ramp_time=30
> runtime=180
> numjobs=1
> filesize=20g
> 
> Okay, here we go.
> 
> First we test it from ceph node itself directly into 1 single SSD and 1
> single SAS. We want to take the result as a base performance benchmark.
> Here is the result :
> 
> SSD
> Read IOPS = 50k
> Write IOPS = 20k
> 
> HDD
> Read IOPS = 1k
> Write IOPS = 1k
> 
> From these result, we assume that when we have 36 OSD hdd and 12 OSD ssd
> suppose we will have in total approximately :
> 
> SSD
> Read IOPS = 12 x 50k = 600k
> Write IOPS = 12 x 20k / 3 replication = 80k
> 
> HDD
> Read IOPS = 36 x 1k = 36k
> Write IOPS = 36 x 1k / 3 replication = 12k
> 
> 
> So, we try to mount 1 RBD Image from pool SSD to 1 VM as a / (root) OS. And
> then we run similar fio with exact same config in the VM, but we only have
> these result :
> 
> SSD
> Read IOPS = 46k
> Write IOPS = 14.4k
> 
> It's just like IOPS performance with single SSD disk, not as a whole array
> calculation.
> 
> At first we assume that, maybe when we try to run 2 simulatenously fio test
> on 2 VM with 2 RBD Image from same pool it will give same result. So on
> theory, we can run 12 VM maximum to get maximum performance (cummulative)
> to 12 SSD OSD. But after we run, the results is become divided by 2, far
> from my assumption.
> 
> VM 1
> SSD
> Read IOPS = 23k
> Write IOPS = 7k
> 
> VM 2
> SSD
> Read IOPS = 23k
> Write IOPS = 7k
> 
> The results mean that the first fio test is "really" maximum performance.
> So my system performance is only 8-10% than it should be. Although We need
> to at least reach 60-70% as a KPI.
> 
> Second try, i change my OS VM with RBD Image from pool HDD. When we run,
> weird enough that i get the exact same result like when we test with RBD
> image from pool SSD :
> 
> HDD
> Read IOPS = 46k
> Write IOPS = 14.4k
> 
> I'm a little bit confuse now. I suppose to get different results when using
> different pool image, but it isnt. It's like using 1 same performance.
> Although we're really sure that we alreay separate the SSD and HDD pool and
> crushmap.
> 
> My question is :
> 
> 1. Why i get same test results although i already test it with 2 different
> RBD Image from 2 different pool and ruleset (SSD and hdd)?
> 2. About the results, which exactly performance do i get? If it's really
> the performance from SSD why it is so poor? And why when we test HDD pool
> performance it also shows this SSD performance?
> 3. Vice versa, if it show performance from HDD, which is the pool HDD
> theoritically correct, why when i test it from pool SSD it also shows this
> HDD pool performance?
> 4. Where do i do wrong? Am i wrong from the concept after all? Or my
> understanding of the concept is correct, but i need to do something on my
> system configuration ?
> 
> If you need another data of my system that help you to analyze, i will give
> you asap. Thank you all :)

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Rakuten Communications
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com