Hey everyone,
Currently running a 4 node Proxmox cluster with external Ceph cluster (Ceph using CentOS 7). 4 nodes Ceph OSD installed, each node have spesification like this:
- 8 Core intel Xeon processor
- 32GB RAM
- 2 x 600GB HDD SAS for CentOS (RAID1 as a System)
- 9 x 1200GB HDD SAS for Data (RAID0 each, bluestore), with 2 x 480GB SSD for block.db & block.wal
- 3 x 960GB SSD for faster pool (RAID0 each, bluestore without separate block.db & block.wal)
- 10Gb eth network
So, total we have 36 OSD hdd and 12 OSD ssd.
And Here is our network topology :
On those cluster, i make 4 pool with 3 replication:
1. rbd-data (mount at proxmox for store block data on vm. This pool I set on hdd OSD)
2. rbd-os (mount at proxmox for store block OS on vm for better performance. This pool I set on ssd OSD)
3. cephfs-data (using same device and ruleset like rbd-data, mount at proxmox as a cephfs-data)
4. cephfs-metadata
Here is our crushmap config (to make sure that we already separate ssd disk and hdd disk into different pool and ruleset :
# begin crush map
.
...
# buckets
host z1 {
id -3 # do not change unnecessarily
id -16 class hdd # do not change unnecessarily
id -22 class ssd # do not change unnecessarily
# weight 10.251
alg straw2
hash 0 # rjenkins1
item osd.0 weight 1.139
item osd.1 weight 1.139
item osd.2 weight 1.139
item osd.3 weight 1.139
item osd.4 weight 1.139
item osd.5 weight 1.139
item osd.6 weight 1.139
item osd.7 weight 1.139
item osd.8 weight 1.139
}
host z2 {
id -5 # do not change unnecessarily
id -17 class hdd # do not change unnecessarily
id -23 class ssd # do not change unnecessarily
# weight 10.251
alg straw2
hash 0 # rjenkins1
item osd.9 weight 1.139
item osd.10 weight 1.139
item osd.11 weight 1.139
item osd.12 weight 1.139
item osd.13 weight 1.139
item osd.14 weight 1.139
item osd.15 weight 1.139
item osd.16 weight 1.139
item osd.17 weight 1.139
}
host z3 {
id -7 # do not change unnecessarily
id -18 class hdd # do not change unnecessarily
id -24 class ssd # do not change unnecessarily
# weight 10.251
alg straw2
hash 0 # rjenkins1
item osd.18 weight 1.139
item osd.19 weight 1.139
item osd.20 weight 1.139
item osd.21 weight 1.139
item osd.22 weight 1.139
item osd.23 weight 1.139
item osd.24 weight 1.139
item osd.25 weight 1.139
item osd.26 weight 1.139
}
host s1 {
id -9 # do not change unnecessarily
id -19 class hdd # do not change unnecessarily
id -25 class ssd # do not change unnecessarily
# weight 10.251
alg straw2
hash 0 # rjenkins1
item osd.27 weight 1.139
item osd.28 weight 1.139
item osd.29 weight 1.139
item osd.30 weight 1.139
item osd.31 weight 1.139
item osd.32 weight 1.139
item osd.33 weight 1.139
item osd.34 weight 1.139
item osd.35 weight 1.139
}
root sas {
id -1 # do not change unnecessarily
id -21 class hdd # do not change unnecessarily
id -26 class ssd # do not change unnecessarily
# weight 51.496
alg straw2
hash 0 # rjenkins1
item z1 weight 12.874
item z2 weight 12.874
item z3 weight 12.874
item s1 weight 12.874
}
host z1-ssd {
id -101 # do not change unnecessarily
id -2 class hdd # do not change unnecessarily
id -11 class ssd # do not change unnecessarily
# weight 2.619
alg straw2
hash 0 # rjenkins1
item osd.36 weight 0.873
item osd.37 weight 0.873
item osd.38 weight 0.873
}
host z2-ssd {
id -104 # do not change unnecessarily
id -4 class hdd # do not change unnecessarily
id -12 class ssd # do not change unnecessarily
# weight 2.619
alg straw2
hash 0 # rjenkins1
item osd.39 weight 0.873
item osd.40 weight 0.873
item osd.41 weight 0.873
}
host z3-ssd {
id -107 # do not change unnecessarily
id -6 class hdd # do not change unnecessarily
id -13 class ssd # do not change unnecessarily
# weight 2.619
alg straw2
hash 0 # rjenkins1
item osd.42 weight 0.873
item osd.43 weight 0.873
item osd.44 weight 0.873
}
host s1-ssd {
id -110 # do not change unnecessarily
id -8 class hdd # do not change unnecessarily
id -14 class ssd # do not change unnecessarily
# weight 2.619
alg straw2
hash 0 # rjenkins1
item osd.45 weight 0.873
item osd.46 weight 0.873
item osd.47 weight 0.873
}
root ssd {
id -20 # do not change unnecessarily
id -10 class hdd # do not change unnecessarily
id -15 class ssd # do not change unnecessarily
# weight 10.476
alg straw2
hash 0 # rjenkins1
item z1-ssd weight 2.619
item z2-ssd weight 2.619
item z3-ssd weight 2.619
item s1-ssd weight 2.619
}
# rules
rule sas_ruleset {
id 0
type replicated
min_size 1
max_size 10
step take sas
step chooseleaf firstn 0 type host
step emit
}
rule ssd_ruleset {
id 1
type replicated
min_size 1
max_size 10
step take ssd
step chooseleaf firstn 0 type host
step emit
}
rule cephfs_ruleset {
id 2
type replicated
min_size 1
max_size 10
step take sas
step chooseleaf firstn 0 type host
step emit
}
# end crush map
So far, we test the system functionality was good and no problem. But we need to prove that our system performance is good, especially from IOPS perspective.
Our method to prove it, just like this :
1. we try to benchmark each of single disk performance, make them as a base performance.
2. we calculate theoritically maximum IOPS (read & write) in a whole array (base on number 1 test)
3. we try to benchmark on 1 VM that the OS using RBD image using pool from cluster ceph.
KPI : we expect minimum 60%-70% max IOPS on VM test is near maximum IOPS in a whole array calculation.
To benchmark the cluster with some test, we use FIO.
The fio config we use :
READ RUN
ioengine=libaio
sync=0
fsync=1
direct=1
runtime=180
ramp_time=30
numjobs=1
filesize=20g
WRITE RUN
ioengine=psync
direct=0
ramp_time=30
runtime=180
numjobs=1
filesize=20g
Okay, here we go.
First we test it from ceph node itself directly into 1 single SSD and 1 single SAS. We want to take the result as a base performance benchmark. Here is the result :
SSD
Read IOPS = 50k
Write IOPS = 20k
HDD
Read IOPS = 1k
Write IOPS = 1k
From these result, we assume that when we have 36 OSD hdd and 12 OSD ssd suppose we will have in total approximately :
SSD
Read IOPS = 12 x 50k = 600k
Write IOPS = 12 x 20k / 3 replication = 80k
HDD
Read IOPS = 36 x 1k = 36k
Write IOPS = 36 x 1k / 3 replication = 12k
So, we try to mount 1 RBD Image from pool SSD to 1 VM as a / (root) OS. And then we run similar fio with exact same config in the VM, but we only have these result :
SSD
Read IOPS = 46k
Write IOPS = 14.4k
It's just like IOPS performance with single SSD disk, not as a whole array calculation.
At first we assume that, maybe when we try to run 2 simulatenously fio test on 2 VM with 2 RBD Image from same pool it will give same result. So on theory, we can run 12 VM maximum to get maximum performance (cummulative) to 12 SSD OSD. But after we run, the results is become divided by 2, far from my assumption.
VM 1
SSD
Read IOPS = 23k
Write IOPS = 7k
VM 2
SSD
Read IOPS = 23k
Write IOPS = 7k
The results mean that the first fio test is "really" maximum performance. So my system performance is only 8-10% than it should be. Although We need to at least reach 60-70% as a KPI.
Second try, i change my OS VM with RBD Image from pool HDD. When we run, weird enough that i get the exact same result like when we test with RBD image from pool SSD :
HDD
Read IOPS = 46k
Write IOPS = 14.4k
I'm a little bit confuse now. I suppose to get different results when using different pool image, but it isnt. It's like using 1 same performance. Although we're really sure that we alreay separate the SSD and HDD pool and crushmap.
My question is :
1. Why i get same test results although i already test it with 2 different RBD Image from 2 different pool and ruleset (SSD and hdd)?
2. About the results, which exactly performance do i get? If it's really the performance from SSD why it is so poor? And why when we test HDD pool performance it also shows this SSD performance?
3. Vice versa, if it show performance from HDD, which is the pool HDD theoritically correct, why when i test it from pool SSD it also shows this HDD pool performance?
4. Where do i do wrong? Am i wrong from the concept after all? Or my understanding of the concept is correct, but i need to do something on my system configuration ?
If you need another data of my system that help you to analyze, i will give you asap. Thank you all :)
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com