Hey Mark,
I forgot to explain what actually are the numbers in my tests.
It's an average IOPS of the entire cluster according to formula
Result = SUM(avg iops rbd1, avg iops rbd2, ..., avg iops rbd128)
where "avg iops rbd1" - is avg iops reported by fio (fio runs 1 job per image)
Is that also the case for your results?
As for the cluster aging and run times.
- each test run took 10mins
- the cluster was a long lasting one, upgraded multiple times (starting from 14.2.16, .22, 15.2.14 and so on)
- I tried with a fresh cluster as well, 15.2.14 from canonical always leads.
I also ran longer tests (1h and 2h 4k writes, 15.2.14, and 16.2.x series).
The perf looks stable and is more less in line with the results I've presented. What's more the perf looks stable.
These are my FIO settings which I find relevant:
--cpus_allowed_policy split
--randrepeat 0
--disk_util 0
--time_based
--ramp_time 10
--numjobs 1
--buffered 0
Also I did some tests with custom build of 16.2.7:
- march znver2
- O3
- gcc and clang
- link time optimization
The performance for reads was improved but for writes it was quite unstable (observed on all my custom builds).
I'll try to share more results and some screens with timeseries soon.
I forgot to explain what actually are the numbers in my tests.
It's an average IOPS of the entire cluster according to formula
Result = SUM(avg iops rbd1, avg iops rbd2, ..., avg iops rbd128)
where "avg iops rbd1" - is avg iops reported by fio (fio runs 1 job per image)
Is that also the case for your results?
As for the cluster aging and run times.
- each test run took 10mins
- the cluster was a long lasting one, upgraded multiple times (starting from 14.2.16, .22, 15.2.14 and so on)
- I tried with a fresh cluster as well, 15.2.14 from canonical always leads.
I also ran longer tests (1h and 2h 4k writes, 15.2.14, and 16.2.x series).
The perf looks stable and is more less in line with the results I've presented. What's more the perf looks stable.
These are my FIO settings which I find relevant:
--cpus_allowed_policy split
--randrepeat 0
--disk_util 0
--time_based
--ramp_time 10
--numjobs 1
--buffered 0
Also I did some tests with custom build of 16.2.7:
- march znver2
- O3
- gcc and clang
- link time optimization
The performance for reads was improved but for writes it was quite unstable (observed on all my custom builds).
I'll try to share more results and some screens with timeseries soon.
If there's something you'd like me to test on my setup then feel free to let me know.
BR
BR
From: Mark Nelson <mnelson@xxxxxxxxxx>
Sent: Wednesday, February 23, 2022 3:26:32 PM
To: Bartosz Rabiega; dev; ceph-devel
Subject: Re: Benching ceph for high speed RBD
Sent: Wednesday, February 23, 2022 3:26:32 PM
To: Bartosz Rabiega; dev; ceph-devel
Subject: Re: Benching ceph for high speed RBD
Hi Bartosz,
As luck would have it, I've been running through a huge bisection sweep
lately looking at regressions as well so can give you similar numbers
for our AMD test cluster. I'll try to format similarly to what you've
got below. The biggest difference between our tests is likely that you
are far more CPU, Memory, and PCIe limited in your tests than I am in
mine. You might be in some cases showing similar regression or
different ones depending on how the test is limited. Also, I noticed
you don't mention run times or cluster aging. That can also have an effect.
Hardware setup
--------------
10x backend servers
CPU: 1x AMD EPYC 7742 64-Core (64c+64t)
Storage: 6x NVMe (4TB Samsung PM983)
Network: 100gbps
OS: CentOS Stream
Kernel: 4.18.0-358.el8.x86
Server nodes also serving as clients
Software config
---------------
60 OSDs in total (6 OSDs per host)
1 OSD per NVMe drive
Each OSD runs on bare metal
4k min_alloc size_ssd (even for previous releases that used 16k)
rbd cache disabled
8GB osd_memory_target
Scrub disabled
Deep-scrub disabled
Ceph balancer off
1 pool 'rbd':
- 1024 PG
- PG autoscaler off
- 3x replication
Tests
----------------
qd - queue depth (number of IOs issued simultaneously to single RBD image)
Test environment
----------------
- 40 rbd images (default features, size 256GB)
- All the images have 64GB written before tests are done (64GB dataset per image).
- client version same as osd version
- Each node runs fio with rbd engine (librbd) against 4 rbd images (10x4 in total)
- Ceph is compiled and installed from src.
- In some cases entire tests were repeated (labeled b, c, d, etc after the verison)
IOPS tests
==========
- 4MB, 128KB, and 4KB IO Sizes.
- read(seq), write(seq), randread, randwrite test types
- Each combination of io size and test type is run in 1 Sweep
- Cluster runs 3 sweeps
- Cluster is rebuilt for every ceph release
- 300s runtime per test
- 256qd
4k randwrite Sweep 0 IOPS Sweep 1 IOPS Sweep 2 IOPS Notes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14.0.0 496381 491794 451793 Worst
14.2.0 638907 620820 596170 Big improvement
14.2.10 624802 565738 561014
14.2.16 628949 564055 523766
14.2.17 616004 550045 507945
14.2.22 711234 654464 614117 Huge start, but degrades
15.0.0 636659 620931 583773
15.2.15 580792 574461 569541 No longer degrades
15.2.15b 584496 577238 572176 Same
16.0.0 551112 550874 549273 Worse than octopus? (Doesn't match prior Intel tests)
16.2.0 518326 515684 523475 Regression, doesn't degrade
16.2.4 516891 519046 525918
16.2.6 585061 595001 595702 Big win, doesn't degrade
16.2.7 597822 605107 603958 Same
16.2.7b 586469 600668 599763 Same
FWIW, we've also been running single OSD performance bisections:
https://gist.github.com/markhpc/fda29821d4fd079707ec366322662819
I believe at least one of the regressions may be related to
https://github.com/ceph/ceph/pull/29674
There are other things going on in other tests (large sequential writes!) that are still being diagnosed.
Mark
On 2/23/22 05:10, Bartosz Rabiega wrote:
> Hello cephers,
>
> I've recently been doing some intensive performance benchmarks of different ceph versions.
> I'm trying to figure out perf numbers which can be achieved for high speed ceph setup for RBD (3x replica).
> From what I can see there is a significant perf drop on 16.2.x series (4k writes).
> I can't find any clear reason for such behavior.
>
> Hardware setup
> --------------
> 3x backend servers
> CPU: 2x AMD EPYC 7402 24-Core (48c+48t)
> Storage: 24x NVMe
> Network: 40gbps
> OS: Ubuntu Focal
> Kernel: 5.15.0-18-generic
>
> 4x client servers
> CPU: 2x AMD EPYC 7402 24-Core (48c+48t)
> Network: 40gbps
> OS: Ubuntu Focal
> Kernel: 5.11.0-37-generic
>
> Software config
> ---------------
> 72 OSDs in total (24 OSDs per host)
> 1 OSD per NVMe drive
> Each OSD runs in LXD container
> Scrub disabled
> Deep-scrub disabled
> Ceph balancer off
> 1 pool 'rbd':
> - 1024 PG
> - PG autoscaler off
>
> Test environment
> ----------------
> - 128 rbd images (default features, size 128GB)
> - All the images are fully written before any tests are done! (4194909 objects allocated)
> - client version ceph 16.2.7 vanilla eu.ceph.com
> - Each client runs fio with rbd engine (librbd) against 32 rbd images (4x32 in total)
>
>
> Tests
> ----------------
> qd - queue depth (number of IOs issued simultaneously to single RBD image)
>
> IOPS tests
> ==========
> - random IO 4k, 4qd
> - random IO 4k, 64qd
>
> Write 4k 4qd 4k 64qd
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 14.2.16 69630 132093
> 14.2.22 97491 156288
> 15.2.14 77586 93003
> *15.2.14 – canonical 110424 168943
> 16.2.0 70526 85827
> 16.2.2 69897 85231
> 16.2.4 64713 84046
> 16.2.5 62099 85053
> 16.2.6 68394 83070
> 16.2.7 66974 78601
>
>
> Read 4k 4qd 4k 64qd
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 14.2.16 692848 816109
> 14.2.22 693027 830485
> 15.2.14 676784 702233
> *15.2.14 – canonical 749404 792385
> 16.2.0 610798 636195
> 16.2.2 606924 637611
> 16.2.4 611093 630590
> 16.2.5 603162 632599
> 16.2.6 603013 627246
> 16.2.7 - -
>
> * Very oddly the best perf was achieved with build Ceph 15.2.14 from canonical 15.2.14-0ubuntu0.20.04.2
> 14.2.22 performs very well
> 15.2.14 from canonical is the best in terms of writes.
> 16.2.x series writes are quite poor comparing to other versions.
>
> BW tests
> ========
> - sequential IO 64k, 64qd
>
> These results are mostly the same for all ceph versions.
> Writes ~4.2 GB/s
> Reads ~12 GB/s
>
> Seems that results here are limited by network bandwitdh.
>
>
> Questions
> ---------
> Is there any reason for the performance drop in 16.x series?
> I'm looking for some help/recommendations to get as much IOPS as possible (especially for writes, as reads are good enough)
>
> We've been trying to find out what makes the difference in canonical builds. A few leads indicates that
> extraopts += -DCMAKE_BUILD_TYPE=RelWithDebInfo was not set for builds from ceph foundation
> https://github.com/ceph/ceph/blob/master/do_cmake.sh#L86
> How to check this, would someone be able to take a look there?
>
> BR
> Bartosz Rabiega
>
As luck would have it, I've been running through a huge bisection sweep
lately looking at regressions as well so can give you similar numbers
for our AMD test cluster. I'll try to format similarly to what you've
got below. The biggest difference between our tests is likely that you
are far more CPU, Memory, and PCIe limited in your tests than I am in
mine. You might be in some cases showing similar regression or
different ones depending on how the test is limited. Also, I noticed
you don't mention run times or cluster aging. That can also have an effect.
Hardware setup
--------------
10x backend servers
CPU: 1x AMD EPYC 7742 64-Core (64c+64t)
Storage: 6x NVMe (4TB Samsung PM983)
Network: 100gbps
OS: CentOS Stream
Kernel: 4.18.0-358.el8.x86
Server nodes also serving as clients
Software config
---------------
60 OSDs in total (6 OSDs per host)
1 OSD per NVMe drive
Each OSD runs on bare metal
4k min_alloc size_ssd (even for previous releases that used 16k)
rbd cache disabled
8GB osd_memory_target
Scrub disabled
Deep-scrub disabled
Ceph balancer off
1 pool 'rbd':
- 1024 PG
- PG autoscaler off
- 3x replication
Tests
----------------
qd - queue depth (number of IOs issued simultaneously to single RBD image)
Test environment
----------------
- 40 rbd images (default features, size 256GB)
- All the images have 64GB written before tests are done (64GB dataset per image).
- client version same as osd version
- Each node runs fio with rbd engine (librbd) against 4 rbd images (10x4 in total)
- Ceph is compiled and installed from src.
- In some cases entire tests were repeated (labeled b, c, d, etc after the verison)
IOPS tests
==========
- 4MB, 128KB, and 4KB IO Sizes.
- read(seq), write(seq), randread, randwrite test types
- Each combination of io size and test type is run in 1 Sweep
- Cluster runs 3 sweeps
- Cluster is rebuilt for every ceph release
- 300s runtime per test
- 256qd
4k randwrite Sweep 0 IOPS Sweep 1 IOPS Sweep 2 IOPS Notes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14.0.0 496381 491794 451793 Worst
14.2.0 638907 620820 596170 Big improvement
14.2.10 624802 565738 561014
14.2.16 628949 564055 523766
14.2.17 616004 550045 507945
14.2.22 711234 654464 614117 Huge start, but degrades
15.0.0 636659 620931 583773
15.2.15 580792 574461 569541 No longer degrades
15.2.15b 584496 577238 572176 Same
16.0.0 551112 550874 549273 Worse than octopus? (Doesn't match prior Intel tests)
16.2.0 518326 515684 523475 Regression, doesn't degrade
16.2.4 516891 519046 525918
16.2.6 585061 595001 595702 Big win, doesn't degrade
16.2.7 597822 605107 603958 Same
16.2.7b 586469 600668 599763 Same
FWIW, we've also been running single OSD performance bisections:
https://gist.github.com/markhpc/fda29821d4fd079707ec366322662819
I believe at least one of the regressions may be related to
https://github.com/ceph/ceph/pull/29674
There are other things going on in other tests (large sequential writes!) that are still being diagnosed.
Mark
On 2/23/22 05:10, Bartosz Rabiega wrote:
> Hello cephers,
>
> I've recently been doing some intensive performance benchmarks of different ceph versions.
> I'm trying to figure out perf numbers which can be achieved for high speed ceph setup for RBD (3x replica).
> From what I can see there is a significant perf drop on 16.2.x series (4k writes).
> I can't find any clear reason for such behavior.
>
> Hardware setup
> --------------
> 3x backend servers
> CPU: 2x AMD EPYC 7402 24-Core (48c+48t)
> Storage: 24x NVMe
> Network: 40gbps
> OS: Ubuntu Focal
> Kernel: 5.15.0-18-generic
>
> 4x client servers
> CPU: 2x AMD EPYC 7402 24-Core (48c+48t)
> Network: 40gbps
> OS: Ubuntu Focal
> Kernel: 5.11.0-37-generic
>
> Software config
> ---------------
> 72 OSDs in total (24 OSDs per host)
> 1 OSD per NVMe drive
> Each OSD runs in LXD container
> Scrub disabled
> Deep-scrub disabled
> Ceph balancer off
> 1 pool 'rbd':
> - 1024 PG
> - PG autoscaler off
>
> Test environment
> ----------------
> - 128 rbd images (default features, size 128GB)
> - All the images are fully written before any tests are done! (4194909 objects allocated)
> - client version ceph 16.2.7 vanilla eu.ceph.com
> - Each client runs fio with rbd engine (librbd) against 32 rbd images (4x32 in total)
>
>
> Tests
> ----------------
> qd - queue depth (number of IOs issued simultaneously to single RBD image)
>
> IOPS tests
> ==========
> - random IO 4k, 4qd
> - random IO 4k, 64qd
>
> Write 4k 4qd 4k 64qd
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 14.2.16 69630 132093
> 14.2.22 97491 156288
> 15.2.14 77586 93003
> *15.2.14 – canonical 110424 168943
> 16.2.0 70526 85827
> 16.2.2 69897 85231
> 16.2.4 64713 84046
> 16.2.5 62099 85053
> 16.2.6 68394 83070
> 16.2.7 66974 78601
>
>
> Read 4k 4qd 4k 64qd
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 14.2.16 692848 816109
> 14.2.22 693027 830485
> 15.2.14 676784 702233
> *15.2.14 – canonical 749404 792385
> 16.2.0 610798 636195
> 16.2.2 606924 637611
> 16.2.4 611093 630590
> 16.2.5 603162 632599
> 16.2.6 603013 627246
> 16.2.7 - -
>
> * Very oddly the best perf was achieved with build Ceph 15.2.14 from canonical 15.2.14-0ubuntu0.20.04.2
> 14.2.22 performs very well
> 15.2.14 from canonical is the best in terms of writes.
> 16.2.x series writes are quite poor comparing to other versions.
>
> BW tests
> ========
> - sequential IO 64k, 64qd
>
> These results are mostly the same for all ceph versions.
> Writes ~4.2 GB/s
> Reads ~12 GB/s
>
> Seems that results here are limited by network bandwitdh.
>
>
> Questions
> ---------
> Is there any reason for the performance drop in 16.x series?
> I'm looking for some help/recommendations to get as much IOPS as possible (especially for writes, as reads are good enough)
>
> We've been trying to find out what makes the difference in canonical builds. A few leads indicates that
> extraopts += -DCMAKE_BUILD_TYPE=RelWithDebInfo was not set for builds from ceph foundation
> https://github.com/ceph/ceph/blob/master/do_cmake.sh#L86
> How to check this, would someone be able to take a look there?
>
> BR
> Bartosz Rabiega
>
_______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx