Re: Ceph 16.x RBD significant performance degradation over time

"bartosz.rabiega@xxxxxxxxxxxx" <bartosz.rabiega@xxxxxxxxxxxx> · Mon, 25 Apr 2022 11:48:02 +0200

Hey,

Answers inline

On 4/22/22 15:28, Mark Nelson wrote:>
Hi BR,

Numbers are representing IOPS?
Yes the numbers are IOPS, comparison tables are in % (100 * Rn / R1)

Also if this is a custom build, did you 
make sure to pass -DCMAKE_BUILD_TYPE=RelWithDebInfo? 
That's also an interesting story... I think 15.x and 16.x DEB builds 
from eu.ceph.com are missing -o2 (thus missing RelWithDebInfo?)

3 versions comparisons:
a) 15.2.14 from eu.ceph.com (comparison base)
b) 15.2.14 custom build (-march=znver2 -o2)
c) 15.2.14 from Ubuntu (focal 15.2.14-0ubuntu0.20.04.2)
https://brabiega.github.io/ceph/bench/15-2-14-compared.html
a) is significantly slower than b) and c),
b) and c) are more less the same

The custom build is from ubuntu source package with explicitly set 
-march=znver2 -o2.
The build system is quite complicated, I'll dig to make sure what's the 
sate of '-DCMAKE_BUILD_TYPE=RelWithDebInfo?'

Also, what model 
NVMe drive?  
SAMSUNG MZQLB3T8HALS-00007
SAMSUNG MZQLB3T8HALS-00003
3.84TB

The first thought that came to mind was the allocator being
used, but it looks like 14.2.22 and 15.2.14 should both be using hybrid 
like 16.2.7 so it's not likely that.  What I find especially interesting 
though is that we've recently done a ton quite a bit of aging tests on 
16.2.7 on a vaguely similar platform without seeing that kind of 
behavior in pacific (though we did see a regression introduced during 
quincy development that we just fixed last week).

For example:

https://docs.google.com/spreadsheets/d/18wexXme39GTlOlig1Mv6F27-Zky9hk7LWwIQTKjhgf8/edit#gid=0 

That's not quite the same setup as your tests, but vaguely similar. I'll 
try to layout the HW/SW config similarly to how you did:

Hardware setup
--------------
10x backend servers
CPU: 1x AMD EPYC 7742 64-Core (64c+64t)
Storage: 6x NVMe (4TB Samsung PM983)
Network: 100gbps
OS: CentOS Stream
Kernel: 4.18.0-358.el8.x86_64

  Clients co-located on servers

Software config
---------------
  60 OSDs in total (6 OSDs per host)
1 OSD per NVMe drive
Each OSD runs on bare metal
Scrub disabled
Deep-scrub disabled
Ceph balancer off
1 pool 'rbd': (3x replication)
- 1024 PG
- PG autoscaler off
8GB osd_memory_target per OSD

Test environment
----------------
- 40 rbd images (default features, size 64GB)
- All the images are fully written before any tests are done! (4GB writes)
- client versions match server versions (Pacific 16.2.7 vs Quincy 
17.2.0ish)
- Each client runs fio with rbd engine (librbd) against 4 rbd images 
(10x4 in total)

Some immediate observations:
> My numbers are much higher overall (look at 100% 4k random reads: 600K
IOPs vs 2M)
Yes I noticed that earlier (I asked about ceph perf "Benching ceph for 
high speed RBD"). This looks weird and I haven't figured out why the 
difference is so big.

Your degradation always seems to happen when writes are involved
Correct

You have more NVMe drives per host (24 vs 6)

I have more servers with much more aggregate CPU (10x64c vs 4x48c)

You have a much bigger prefill dataset (128x128GB vs 40x64GB)

You have more clients (128 vs 40)

You have lower io_depth (128x[4,32,64] vs (40x256)

I (may?) have higher osd_memory_target per OSD (8GB vs default 4GB)I'm using default 4GB

You are doing LXD containers, I'm doing bare metal
According to some internal tests this only takes a few % of the 
performance. Let's say containers are <5% slower than baremetal.
Although we could probably use more tests here.

New immediate thoughts that comes to mind:

With 3x replication, you have a dataset size of ~683GB per OSD. That's a 
lot of onodes to do random IO across especially as they become 
fragmented and take more space in memory.  I'd expect that you are doing 
a lot of onode reads from disk regardless of the version of Ceph being 
used and it makes senes that it would get worse over time.  That doesn't 
explain why your earlier results don't degrade at all and why your 
16.2.7 results degrade to such a huge extent thoughIn my tests I have 
dataset size of ~128GB per OSD, which I know from previous testing has 
an onode footprint that can fit pretty well into a default 4GB 
osd_memory_target, but I'm also increasing the per-OSD target to 8GB. To 
see such a huge drop in performance either means writes to the drive are 
super slow or something else is slowing them down. 
The degradation is reproducible on all 16.2.x series, I've even tested 
16.2.0, the outcome is the same. And there is no such thing on 14/15 series.
What is more (!):
- I ran a test after long idle period (24h) - single test(25 fio runs), 
write results are still significantly lower comparing to 1st run
- I restarted entire cluster - single test(25 fio runs), write results 
are still significantly lower comparing to 1st run

One possibility is 
throttling at the rocksdb WAL.  It might be worth checking to see if you 
are seeing any write throttling due to slow L0 compactions in rocksdb.
How do I check that?

It might be worth looking at the overall state of the systems as these 
tests run.  Are you CPU limited?  Are you network limited? Do you have 
available memory that could be going to the OSDs?
For small IOs I'm CPU limited.
For big IOs I'm network limited (4GB/s on writes, 12GB/s on reads)
I have 256GB of ram on each server, that's let's say 240GB / 24OSD = 
10GB / osd

If nothing else explains it, you can always try my wallclock profiler 
against one of the OSDs:

https://github.com/markhpc/uwpmp

run it like: './unwindpmp -p `pidof ceph-osd` -n 1000 -b libdw
I'll check that as a last resort :)

I'll try to run a workload similar to yours (40 images, 256qd). Let's 
see what do I get.

The only difference that comes to my mind is that my tests were a bit 
longer (15 mins vs 5 mins? per fio run)

I'd be happy to hear any other ideas

Thanks for your answers Mark.
BR
BR
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx