RBD throughput/IOPS benchmarks

Vincent KHERBACHE <v.kherbache@xxxxxxxxxx> · Wed, 6 May 2020 18:53:19 +0200

Dear all,

We currently run a small Ceph cluster on 2 machines and we wonder what 
are the theoretical max BW/IOPS we can achieve through RBD with our setup.

Here are the environment details:

- The Ceph release is an octopus 15.2.1 running on Centos 8, both 
machines have 180GB RAM, 72 cores, and 40 * 1.8TB SSD disks each
- Regarding network we deployed two isolated 100Gb/s networks for front 
and back connectivity
- Since all disks have the same performance, we created 1 OSD per SSD 
using bluestore (default setup with LVM) to reach a total of 80 OSDs (40 
OSD per machine)
- On top of that we have a single 2x replicated RBD pool with 2048 PGs 
in order to reach a global average of 50 PGs per OSD (our experiments 
with 100 PGs/OSD didn't provided perfomance improvement, only extra CPU 
consumption)
- We kept default settings for all RBD images we created for benchmarks 
(4MB obj size, 4MB stripe width, 1 stripe)
- The crush map and replication rules used are very simple (2 hosts, 40 
OSDs per host with same device class and weight)
- All tuning settings (caches sizing, op threads, bluestore, rocksdb 
options, etc.) are the default options provided with the Octopus release.

Here are the best values observed so far using both rados bench and fio 
with many different setup (varying amount of clients, threads, RBD 
images, bloc sizes from 4k to 4m, random/sequential, iodepth, etc.):

- Read BW: 24GB/s (looks like we reached the maximum network capacity of 
both machines here)
- Read IOPS: 600k
- Write BW: 7 GB/s
- Write IOPS: 100k

Those are simply the maximum numbers obtained regardless latency as we 
first want to stress the infrastructure to see what are the maximum 
thoughput & IOPS we can achieve. Latency care/measurements will come after.

We also have the feelings that the 2x replication of the RBD pool is a 
big deal with only 2 nodes in the cluster, dividing maximum speeds by 
more than 2. This will probably have much less impact when scaling up 
the cluster with new nodes.

We also noticed that at some point during recovery operations (eg. 
rebalancing PGs after new OSD was added into the pool) the total 
read/write throughput and IOPS are climbing to several GB/s and millions 
IOPS, so we wonder if we can achieve any better with legitimate RBD 
clients load.

Do you guys would like to share numbers of your setup or have any hints 
for potential improvements?

Thanks.

Regards,

--
Vincent Kherbache
R&D Director
Titan Datacenter
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx