Ceph Performance very bad even in Memory?!

"sascha a." <sascha.arthur@xxxxxxxxx> · Sun, 30 Jan 2022 05:41:23 +0100

Hello,

Im currently in progress of setting up a production ceph cluster on a 40
gbit network (for sure 40gb internal and public network).

Did a lot of machine/linux tweeking already:

- cpupower state disable
- lowlatency kernel
- kernel tweekings
- rx buffer optimize
- affinity mappings
- correct bus mapping of pcie cards
- mtu..
+ many more

My machines are connected over a switch which is capable of doing multiple
TBit/s.

iperf result between two machines:
single connection ~20 Gbit/s (reaching single core limits)
multiple connections ~39 Gbit/s

Perfect starting point i would say. Let's assume we only reach 20 Gbit/s as
network speed.

Now i wanted to check how much overhead ceph has and what's the absolute
maximum i could get out of my cluster.

Using 8 Servers (16 cores + HT), (all the same cpus/ram/network cards).
Having plenty of RAM which I'm using for my tests. Simple pool by using 3x
replication.
For that reason and to prove that my servers have high quality and speed I
created 70 GB RAM-drives on all of them.

The RAM-drives were created by using the kernel module "brd".
Benchmarking the RAM-drives gave the following result by using fio:

5m IO/s read@4k
4m IO/s write@4k

read and write | latency below 10 us@QD=16
read and write | latency below 50 us@QD=256

1,5 GB/s sequential read@4k (QD=1)
1,0 GB/s sequential write@4k (QD=1)

15 GB/s  read@4k (QD=256)
10 GB/s  write@4k (QD=256)

Pretty impressive, disks we are all dreaming about I would say.

Making sure i don't bottleneck anything i created following Setup:

- 3 Servers, each running a mon,mgr and mds (all running in RAM including
their dirs in /var/lib/ceph.. by using tmpfs/brd ramdisk)
- 4 Servers mapping their ramdrive as OSDs, created with bluestore by using
ceph-volume raw or lvm.
- 1 Server using rbd-nbd to map one rbd as drive and to benchmark it

I would in this scenario expect impressive results, the only
bottleneck between the servers is the 20 Gbit network speed.
Everything else is running completely in low latency ECC memory and should
be blazing fast until the network speed is reached.

The benchmark was monitored by using this tool here:
https://github.com/ceph/ceph/blob/master/src/tools/histogram_dump.py also
by looking at the raw data of  "ceph daemon osd.7 perf dump".

*Result:*
Either there is something really going wrong or ceph has huge bottlenecks
inside the software which should be solved...

histogram_dump spiked often the latency between "1M and 51M", ... which is
when i read it correctly seconds?!
How is that possible? This should always be between 0-99k..

The result of perf dump was also crazy slow:
https://pastebin.com/ukV0LXWH

especially kpis here:
- op_latency
- op_w_latency
- op_w_prepare_latency
- state_deferred_cleanup_lat

all in areas of milliseconds?

FIO reached against the nbd drive 40k IOPs@4k with QD=256 and latency which
is in 1-7 milliseconds..

Calculating with 40gbit, 4k of data flows through the network
about 4*1024/40000000000/8 = 128ns, lets say its 2us cause of the overhead,
multiplied by 3 RTT cause of the replication... should still reach latency
easy below 50us.
Also I read a lot of slow flushing disks, but there's definitely something
going on in the software when it's not even capable of doing in memory
disks without taking milliseconds..

All servers were using ubuntu 21.10 and "ceph version 16.2.6
(ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable) "

What did I do wrong? What's going on here?!
Please help me out!
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx