Re: Benching ceph for high speed RBD

Mark Nelson <mnelson@xxxxxxxxxx> · Wed, 23 Feb 2022 08:26:32 -0600

Hi Bartosz,

As luck would have it, I've been running through a huge bisection sweep 
lately looking at regressions as well so can give you similar numbers 
for our AMD test cluster. I'll try to format similarly to what you've 
got below.  The biggest difference between our tests is likely that you 
are far more CPU, Memory, and PCIe limited in your tests than I am in 
mine.  You might be in some cases showing similar regression or 
different ones depending on how the test is limited.  Also, I noticed 
you don't mention run times or cluster aging.  That can also have an effect.

Hardware setup
--------------
10x backend servers
CPU: 1x AMD EPYC 7742 64-Core (64c+64t)
Storage: 6x NVMe (4TB Samsung PM983)
Network: 100gbps
OS: CentOS Stream
Kernel: 4.18.0-358.el8.x86

Server nodes also serving as clients

Software config
---------------
60 OSDs in total (6 OSDs per host)
1 OSD per NVMe drive
Each OSD runs on bare metal
4k min_alloc size_ssd (even for previous releases that used 16k)
rbd cache disabled
8GB osd_memory_target
Scrub disabled
Deep-scrub disabled
Ceph balancer off
1 pool 'rbd':
- 1024 PG
- PG autoscaler off
- 3x replication

Tests
----------------
qd - queue depth (number of IOs issued simultaneously to single RBD image)

Test environment
----------------
- 40 rbd images (default features, size 256GB)
- All the images have 64GB written before tests are done (64GB dataset per image).
- client version same as osd version
- Each node runs fio with rbd engine (librbd) against 4 rbd images (10x4 in total)
- Ceph is compiled and installed from src.
- In some cases entire tests were repeated (labeled b, c, d, etc after the verison)

IOPS tests
==========

- 4MB, 128KB, and 4KB IO Sizes.
- read(seq), write(seq), randread, randwrite test types
- Each combination of io size and test type is run in 1 Sweep
- Cluster runs 3 sweeps
- Cluster is rebuilt for every ceph release
- 300s runtime per test
- 256qd

4k randwrite	Sweep 0 IOPS	Sweep 1 IOPS	Sweep 2 IOPS	Notes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14.0.0		496381		491794		451793		Worst
14.2.0		638907		620820		596170		Big improvement
14.2.10		624802		565738		561014
14.2.16		628949		564055		523766
14.2.17		616004		550045		507945
14.2.22		711234		654464		614117		Huge start, but degrades

15.0.0		636659		620931		583773		
15.2.15		580792		574461		569541		No longer degrades
15.2.15b	584496		577238		572176		Same

16.0.0		551112		550874		549273		Worse than octopus? (Doesn't match prior Intel tests)
16.2.0		518326		515684		523475		Regression, doesn't degrade
16.2.4		516891		519046		525918
16.2.6		585061		595001		595702		Big win, doesn't degrade
16.2.7		597822		605107		603958		Same
16.2.7b		586469		600668		599763		Same

FWIW, we've also been running single OSD performance bisections:
https://gist.github.com/markhpc/fda29821d4fd079707ec366322662819

I believe at least one of the regressions may be related to
https://github.com/ceph/ceph/pull/29674

There are other things going on in other tests (large sequential writes!) that are still being diagnosed.

Mark

On 2/23/22 05:10, Bartosz Rabiega wrote:
Hello cephers,

I've recently been doing some intensive performance benchmarks of different ceph versions.
I'm trying to figure out perf numbers which can be achieved for high speed ceph setup for RBD (3x replica).
 From what I can see there is a significant perf drop on 16.2.x series (4k writes).
I can't find any clear reason for such behavior.

Hardware setup
--------------
3x backend servers
CPU: 2x AMD EPYC 7402 24-Core (48c+48t)
Storage: 24x NVMe
Network: 40gbps
OS: Ubuntu Focal
Kernel: 5.15.0-18-generic

4x client servers
CPU: 2x AMD EPYC 7402 24-Core (48c+48t)
Network: 40gbps
OS: Ubuntu Focal
Kernel: 5.11.0-37-generic

Software config
---------------
72 OSDs in total (24 OSDs per host)
1 OSD per NVMe drive
Each OSD runs in LXD container
Scrub disabled
Deep-scrub disabled
Ceph balancer off
1 pool 'rbd':
- 1024 PG
- PG autoscaler off

Test environment
----------------
- 128 rbd images (default features, size 128GB)
- All the images are fully written before any tests are done! (4194909 objects allocated)
- client version ceph 16.2.7 vanilla eu.ceph.com
- Each client runs fio with rbd engine (librbd) against 32 rbd images (4x32 in total)

Tests
----------------
qd - queue depth (number of IOs issued simultaneously to single RBD image)

IOPS tests
==========
- random IO 4k, 4qd
- random IO 4k, 64qd

Write			4k 4qd	4k 64qd
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14.2.16			69630	132093
14.2.22			97491	156288
15.2.14			77586	93003
*15.2.14 – canonical	110424	168943
16.2.0			70526	85827
16.2.2			69897	85231
16.2.4			64713	84046
16.2.5			62099	85053
16.2.6			68394	83070
16.2.7			66974	78601

Read			4k 4qd	4k 64qd
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14.2.16			692848	816109
14.2.22			693027	830485
15.2.14			676784	702233
*15.2.14 – canonical	749404	792385
16.2.0			610798	636195
16.2.2			606924	637611
16.2.4			611093	630590
16.2.5			603162	632599
16.2.6			603013	627246
16.2.7			-	-

* Very oddly the best perf was achieved with build Ceph 15.2.14 from canonical 15.2.14-0ubuntu0.20.04.2
14.2.22 performs very well
15.2.14 from canonical is the best in terms of writes.
16.2.x series writes are quite poor comparing to other versions.

BW tests
========
- sequential IO 64k, 64qd

These results are mostly the same for all ceph versions.
Writes ~4.2 GB/s
Reads ~12 GB/s

Seems that results here are limited by network bandwitdh.

Questions
---------
Is there any reason for the performance drop in 16.x series?
I'm looking for some help/recommendations to get as much IOPS as possible (especially for writes, as reads are good enough)

We've been trying to find out what makes the difference in canonical builds. A few leads indicates that
extraopts += -DCMAKE_BUILD_TYPE=RelWithDebInfo was not set for builds from ceph foundation
https://github.com/ceph/ceph/blob/master/do_cmake.sh#L86
How to check this, would someone be able to take a look there?

BR
Bartosz Rabiega

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx