Unexpected IOPS Ceph Benchmark Result

Muhammad Fakhri Abdillah <fakhriabdillah37@xxxxxxxxx> · Sun, 21 Apr 2019 13:55:37 +0700

Hey everyone,
Currently running a 4 node Proxmox cluster with external Ceph cluster (Ceph using CentOS 7). 4 nodes Ceph OSD installed, each node have spesification like this:
- 8 Core intel Xeon processor
- 32GB RAM
- 2 x 600GB HDD SAS for CentOS (RAID1 as a System)
- 9 x 1200GB HDD SAS for Data (RAID0 each, bluestore), with 2 x 480GB SSD for block.db & block.wal
- 3 x 960GB SSD for faster pool (RAID0 each, bluestore without separate block.db & block.wal)
- 10Gb eth network

So, total we have 36 OSD hdd and 12 OSD ssd.

And Here is our network topology :

https://imgur.com/eAHb18I

On those cluster, i make 4 pool with 3 replication:
1. rbd-data (mount at proxmox for store block data on vm. This pool I set on hdd OSD)
2. rbd-os (mount at proxmox for store block OS on vm for better performance. This pool I set on ssd OSD)
3. cephfs-data (using same device and ruleset like rbd-data, mount at proxmox as a cephfs-data)
4. cephfs-metadata

Here is our crushmap config (to make sure that we already separate ssd disk and hdd disk into different pool and ruleset :

# begin crush map
.
...

# buckets
host z1 {
        id -3           # do not change unnecessarily
        id -16 class hdd                # do not change unnecessarily
        id -22 class ssd                # do not change unnecessarily
        # weight 10.251
        alg straw2
        hash 0  # rjenkins1
        item osd.0 weight 1.139
        item osd.1 weight 1.139
        item osd.2 weight 1.139
        item osd.3 weight 1.139
        item osd.4 weight 1.139
        item osd.5 weight 1.139
        item osd.6 weight 1.139
        item osd.7 weight 1.139
        item osd.8 weight 1.139
}
host z2 {
        id -5           # do not change unnecessarily
        id -17 class hdd                # do not change unnecessarily
        id -23 class ssd                # do not change unnecessarily
        # weight 10.251
        alg straw2
        hash 0  # rjenkins1
        item osd.9 weight 1.139
        item osd.10 weight 1.139
        item osd.11 weight 1.139
        item osd.12 weight 1.139
        item osd.13 weight 1.139
        item osd.14 weight 1.139
        item osd.15 weight 1.139
        item osd.16 weight 1.139
        item osd.17 weight 1.139
}
host z3 {
        id -7           # do not change unnecessarily
        id -18 class hdd                # do not change unnecessarily
        id -24 class ssd                # do not change unnecessarily
        # weight 10.251
        alg straw2
        hash 0  # rjenkins1
        item osd.18 weight 1.139
        item osd.19 weight 1.139
        item osd.20 weight 1.139
        item osd.21 weight 1.139
        item osd.22 weight 1.139
        item osd.23 weight 1.139
        item osd.24 weight 1.139
        item osd.25 weight 1.139
        item osd.26 weight 1.139
}
host s1 {
        id -9           # do not change unnecessarily
        id -19 class hdd                # do not change unnecessarily
        id -25 class ssd                # do not change unnecessarily
        # weight 10.251
        alg straw2
        hash 0  # rjenkins1
        item osd.27 weight 1.139
        item osd.28 weight 1.139
        item osd.29 weight 1.139
        item osd.30 weight 1.139
        item osd.31 weight 1.139
        item osd.32 weight 1.139
        item osd.33 weight 1.139
        item osd.34 weight 1.139
        item osd.35 weight 1.139
}
root sas {
        id -1           # do not change unnecessarily
        id -21 class hdd                # do not change unnecessarily
        id -26 class ssd                # do not change unnecessarily
        # weight 51.496
        alg straw2
        hash 0  # rjenkins1
        item z1 weight 12.874
        item z2 weight 12.874
        item z3 weight 12.874
        item s1 weight 12.874
}
host z1-ssd {
        id -101         # do not change unnecessarily
        id -2 class hdd         # do not change unnecessarily
        id -11 class ssd                # do not change unnecessarily
        # weight 2.619
        alg straw2
        hash 0  # rjenkins1
        item osd.36 weight 0.873
        item osd.37 weight 0.873
        item osd.38 weight 0.873
}
host z2-ssd {
        id -104         # do not change unnecessarily
        id -4 class hdd         # do not change unnecessarily
        id -12 class ssd                # do not change unnecessarily
        # weight 2.619
        alg straw2
        hash 0  # rjenkins1
        item osd.39 weight 0.873
        item osd.40 weight 0.873
        item osd.41 weight 0.873
}
host z3-ssd {
        id -107         # do not change unnecessarily
        id -6 class hdd         # do not change unnecessarily
        id -13 class ssd                # do not change unnecessarily
        # weight 2.619
        alg straw2
        hash 0  # rjenkins1
        item osd.42 weight 0.873
        item osd.43 weight 0.873
        item osd.44 weight 0.873
}
host s1-ssd {
        id -110         # do not change unnecessarily
        id -8 class hdd         # do not change unnecessarily
        id -14 class ssd                # do not change unnecessarily
        # weight 2.619
        alg straw2
        hash 0  # rjenkins1
        item osd.45 weight 0.873
        item osd.46 weight 0.873
        item osd.47 weight 0.873
}
root ssd {
        id -20          # do not change unnecessarily
        id -10 class hdd                # do not change unnecessarily
        id -15 class ssd                # do not change unnecessarily
        # weight 10.476
        alg straw2
        hash 0  # rjenkins1
        item z1-ssd weight 2.619
        item z2-ssd weight 2.619
        item z3-ssd weight 2.619
        item s1-ssd weight 2.619
}

# rules
rule sas_ruleset {
        id 0
        type replicated
        min_size 1
        max_size 10
        step take sas
        step chooseleaf firstn 0 type host
        step emit
}
rule ssd_ruleset {
        id 1
        type replicated
        min_size 1
        max_size 10
        step take ssd
        step chooseleaf firstn 0 type host
        step emit
}
rule cephfs_ruleset {
        id 2
        type replicated
        min_size 1
        max_size 10
        step take sas
        step chooseleaf firstn 0 type host
        step emit
}

# end crush map

So far, we test the system functionality was good and no problem. But we need to prove that our system performance is good, especially from IOPS perspective.

Our method to prove it, just like this :

1. we try to benchmark each of single disk performance, make them as a base performance.
2. we calculate theoritically maximum IOPS (read & write) in a whole array (base on number 1 test)
3. we try to benchmark on 1 VM that the OS using RBD image using pool from cluster ceph.

KPI : we expect minimum 60%-70% max IOPS on VM test is near maximum IOPS in a whole array calculation.

To benchmark the cluster with some test, we use FIO.

The fio config we use :

READ RUN
ioengine=libaio
sync=0
fsync=1
direct=1
runtime=180
ramp_time=30
numjobs=1
filesize=20g

WRITE RUN
ioengine=psync
direct=0
ramp_time=30
runtime=180 
numjobs=1
filesize=20g

Okay, here we go.

First we test it from ceph node itself directly into 1 single SSD and 1 single SAS. We want to take the result as a base performance benchmark. Here is the result :

SSD
Read IOPS = 50k
Write IOPS = 20k

HDD
Read IOPS = 1k
Write IOPS = 1k

From these result, we assume that when we have 36 OSD hdd and 12 OSD ssd suppose we will have in total approximately :

SSD
Read IOPS = 12 x 50k = 600k
Write IOPS = 12 x 20k / 3 replication = 80k

HDD
Read IOPS = 36 x 1k = 36k
Write IOPS = 36 x 1k / 3 replication = 12k

So, we try to mount 1 RBD Image from pool SSD to 1 VM as a / (root) OS. And then we run similar fio with exact same config in the VM, but we only have these result :

SSD
Read IOPS = 46k
Write IOPS = 14.4k

It's just like IOPS performance with single SSD disk, not as a whole array calculation.

At first we assume that, maybe when we try to run 2 simulatenously fio test on 2 VM with 2 RBD Image from same pool it will give same result. So on theory, we can run 12 VM maximum to get maximum performance (cummulative) to 12 SSD OSD. But after we run, the results is become divided by 2, far from my assumption.

VM 1
SSD
Read IOPS = 23k
Write IOPS = 7k

VM 2
SSD
Read IOPS = 23k
Write IOPS = 7k

The results mean that the first fio test is "really" maximum performance. So my system performance is only 8-10% than it should be. Although We need to at least reach 60-70% as a KPI.

Second try, i change my OS VM with RBD Image from pool HDD. When we run, weird enough that i get the exact same result like when we test with RBD image from pool SSD :

HDD
Read IOPS = 46k
Write IOPS = 14.4k

I'm a little bit confuse now. I suppose to get different results when using different pool image, but it isnt. It's like using 1 same performance. Although we're really sure that we alreay separate the SSD and HDD pool and crushmap.

My question is :

1. Why i get same test results although i already test it with 2 different RBD Image from 2 different pool and ruleset (SSD and hdd)?
2. About the results, which exactly performance do i get? If it's really the performance from SSD why it is so poor? And why when we test HDD pool performance it also shows this SSD performance?
3. Vice versa, if it show performance from HDD, which is the pool HDD theoritically correct, why when i test it from pool SSD it also shows this HDD pool performance?
4. Where do i do wrong? Am i wrong from the concept after all? Or my understanding of the concept is correct, but i need to do something on my system configuration ?

If you need another data of my system that help you to analyze, i will give you asap. Thank you all :)
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com