Re: Ceph 16.x RBD significant performance degradation over time

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 4/22/22 06:08, bartosz.rabiega@xxxxxxxxxxxx wrote:

Hello,

I've been recently doing some extensive tests of Ceph powering block storage (RBD).

What I did is basically:
- create cluster with specific version
- apply the same config
- full prewrite 128 RBD images
- 3x run test suite (25x fio run with different block size and iodepth)

It turns out that RBD performance degrades significantly after 1 test suite for ceph 16.x. There is no such behavior on older versions.
Each fio runs for 15 mins, entire suite is 25 fio runs * 15 mins

Has anyone observed such a thing?

Comparison results are posted on github pages (see relative comparison tables)
https://brabiega.github.io/ceph/bench/14-2-16-eucephcom.html
https://brabiega.github.io/ceph/bench/14-2-22-eucephcom.html
https://brabiega.github.io/ceph/bench/15-2-16-eucephcom.html
https://brabiega.github.io/ceph/bench/16-2-5-eucephcom.html
https://brabiega.github.io/ceph/bench/16-2-7-eucephcom.html
https://brabiega.github.io/ceph/bench/16-2-7-znver2o2-minconf.html (minimal ceph.conf, only basic stuff to have working cluster)

All the results are here including another interesting finding (see 15.2.14 comparison)
https://brabiega.github.io/ceph/ceph.html


Here's my setup:

Hardware setup
--------------
3x backend servers
CPU: 2x AMD EPYC 7402 24-Core (48c+48t)
Storage: 24x NVMe
Network: 40gbps
OS: Ubuntu Focal
Kernel: 5.15.0-18-generic

4x client servers
CPU: 2x AMD EPYC 7402 24-Core (48c+48t)
Network: 40gbps
OS: Ubuntu Focal
Kernel: 5.11.0-37-generic

Software config
---------------
72 OSDs in total (24 OSDs per host)
1 OSD per NVMe drive
Each OSD runs in LXD container
Scrub disabled
Deep-scrub disabled
Ceph balancer off
1 pool 'rbd':
- 1024 PG
- PG autoscaler off

Test environment
----------------
- 128 rbd images (default features, size 128GB)
- All the images are fully written before any tests are done! (4194909 objects allocated)
- client version ceph 16.2.7 vanilla eu.ceph.com
- Each client runs fio with rbd engine (librbd) against 32 rbd images (4x32 in total)

BR


Hi BR,


Numbers are representing IOPS? Also if this is a custom build, did you make sure to pass -DCMAKE_BUILD_TYPE=RelWithDebInfo? Also, what model NVMe drive?  The first thought that came to mind was the allocator being used, but it looks like 14.2.22 and 15.2.14 should both be using hybrid like 16.2.7 so it's not likely that.  What I find especially interesting though is that we've recently done a ton quite a bit of aging tests on 16.2.7 on a vaguely similar platform without seeing that kind of behavior in pacific (though we did see a regression introduced during quincy development that we just fixed last week).


For example:

https://docs.google.com/spreadsheets/d/18wexXme39GTlOlig1Mv6F27-Zky9hk7LWwIQTKjhgf8/edit#gid=0


That's not quite the same setup as your tests, but vaguely similar.  I'll try to layout the HW/SW config similarly to how you did:

Hardware setup
--------------
10x backend servers
CPU: 1x AMD EPYC 7742 64-Core (64c+64t)
Storage: 6x NVMe (4TB Samsung PM983)
Network: 100gbps
OS: CentOS Stream
Kernel: 4.18.0-358.el8.x86_64

 Clients co-located on servers

Software config
---------------
 60 OSDs in total (6 OSDs per host)
1 OSD per NVMe drive
Each OSD runs on bare metal
Scrub disabled
Deep-scrub disabled
Ceph balancer off
1 pool 'rbd': (3x replication)
- 1024 PG
- PG autoscaler off
8GB osd_memory_target per OSD

Test environment
----------------
- 40 rbd images (default features, size 64GB)
- All the images are fully written before any tests are done! (4GB writes)
- client versions match server versions (Pacific 16.2.7 vs Quincy 17.2.0ish)
- Each client runs fio with rbd engine (librbd) against 4 rbd images (10x4 in total)


Some immediate observations:

My numbers are much higher overall (look at 100% 4k random reads: 600K IOPs vs 2M)

Your degradation always seems to happen when writes are involved

You have more NVMe drives per host (24 vs 6)

I have more servers with much more aggregate CPU (10x64c vs 4x48c)

You have a much bigger prefill dataset (128x128GB vs 40x64GB)

You have more clients (128 vs 40)

You have lower io_depth (128x[4,32,64] vs (40x256)

I (may?) have higher osd_memory_target per OSD (8GB vs default 4GB)

You are doing LXD containers, I'm doing bare metal


New immediate thoughts that comes to mind:

With 3x replication, you have a dataset size of ~683GB per OSD. That's a lot of onodes to do random IO across especially as they become fragmented and take more space in memory.  I'd expect that you are doing a lot of onode reads from disk regardless of the version of Ceph being used and it makes senes that it would get worse over time.  That doesn't explain why your earlier results don't degrade at all and why your 16.2.7 results degrade to such a huge extent thoughIn my tests I have dataset size of ~128GB per OSD, which I know from previous testing has an onode footprint that can fit pretty well into a default 4GB osd_memory_target, but I'm also increasing the per-OSD target to 8GB.  To see such a huge drop in performance either means writes to the drive are super slow or something else is slowing them down.  One possibility is throttling at the rocksdb WAL.  It might be worth checking to see if you are seeing any write throttling due to slow L0 compactions in rocksdb.

It might be worth looking at the overall state of the systems as these tests run.  Are you CPU limited?  Are you network limited? Do you have available memory that could be going to the OSDs?


If nothing else explains it, you can always try my wallclock profiler against one of the OSDs:

https://github.com/markhpc/uwpmp

run it like: './unwindpmp -p `pidof ceph-osd` -n 1000 -b libdw



_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx


_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx




[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux