Re: Ceph Performance very bad even in Memory?!

vitalif@xxxxxxxxxx · Sun, 30 Jan 2022 17:37:01 +0000

Hi, yes, it has software bottlenecks :-)

https://yourcmc.ru/wiki/Ceph_performance

If you just need block storage - try Vitastor https://vitastor.io/ https://yourcmc.ru/git/vitalif/vitastor/src/branch/master/README.md - I made it very architecturally similar to Ceph - or if you're fine with even simpler DRBD-based design (no distribution of every image across the whole cluster) - even Linstor.

Ceph is designed for multiple use cases at once, its design tradeoffs are fine for S3 (cold/warm) object storage, but high-performance SSD-based block storage is a different thing :-)

> Hello,
> 
> Im currently in progress of setting up a production ceph cluster on a 40
> gbit network (for sure 40gb internal and public network).
> 
> Did a lot of machine/linux tweeking already:
> 
> - cpupower state disable
> - lowlatency kernel
> - kernel tweekings
> - rx buffer optimize
> - affinity mappings
> - correct bus mapping of pcie cards
> - mtu..
> + many more
> 
> My machines are connected over a switch which is capable of doing multiple
> TBit/s.
> 
> iperf result between two machines:
> single connection ~20 Gbit/s (reaching single core limits)
> multiple connections ~39 Gbit/s
> 
> Perfect starting point i would say. Let's assume we only reach 20 Gbit/s as
> network speed.
> 
> Now i wanted to check how much overhead ceph has and what's the absolute
> maximum i could get out of my cluster.
> 
> Using 8 Servers (16 cores + HT), (all the same cpus/ram/network cards).
> Having plenty of RAM which I'm using for my tests. Simple pool by using 3x
> replication.
> For that reason and to prove that my servers have high quality and speed I
> created 70 GB RAM-drives on all of them.
> 
> The RAM-drives were created by using the kernel module "brd".
> Benchmarking the RAM-drives gave the following result by using fio:
> 
> 5m IO/s read@4k
> 4m IO/s write@4k
> 
> read and write | latency below 10 us@QD=16
> read and write | latency below 50 us@QD=256
> 
> 1,5 GB/s sequential read@4k (QD=1)
> 1,0 GB/s sequential write@4k (QD=1)
> 
> 15 GB/s read@4k (QD=256)
> 10 GB/s write@4k (QD=256)
> 
> Pretty impressive, disks we are all dreaming about I would say.
> 
> Making sure i don't bottleneck anything i created following Setup:
> 
> - 3 Servers, each running a mon,mgr and mds (all running in RAM including
> their dirs in /var/lib/ceph.. by using tmpfs/brd ramdisk)
> - 4 Servers mapping their ramdrive as OSDs, created with bluestore by using
> ceph-volume raw or lvm.
> - 1 Server using rbd-nbd to map one rbd as drive and to benchmark it
> 
> I would in this scenario expect impressive results, the only
> bottleneck between the servers is the 20 Gbit network speed.
> Everything else is running completely in low latency ECC memory and should
> be blazing fast until the network speed is reached.
> 
> The benchmark was monitored by using this tool here:
> https://github.com/ceph/ceph/blob/master/src/tools/histogram_dump.py also
> by looking at the raw data of "ceph daemon osd.7 perf dump".
> 
> *Result:*
> Either there is something really going wrong or ceph has huge bottlenecks
> inside the software which should be solved...
> 
> histogram_dump spiked often the latency between "1M and 51M", ... which is
> when i read it correctly seconds?!
> How is that possible? This should always be between 0-99k..
> 
> The result of perf dump was also crazy slow:
> https://pastebin.com/ukV0LXWH
> 
> especially kpis here:
> - op_latency
> - op_w_latency
> - op_w_prepare_latency
> - state_deferred_cleanup_lat
> 
> all in areas of milliseconds?
> 
> FIO reached against the nbd drive 40k IOPs@4k with QD=256 and latency which
> is in 1-7 milliseconds..
> 
> Calculating with 40gbit, 4k of data flows through the network
> about 4*1024/40000000000/8 = 128ns, lets say its 2us cause of the overhead,
> multiplied by 3 RTT cause of the replication... should still reach latency
> easy below 50us.
> Also I read a lot of slow flushing disks, but there's definitely something
> going on in the software when it's not even capable of doing in memory
> disks without taking milliseconds..
> 
> All servers were using ubuntu 21.10 and "ceph version 16.2.6
> (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable) "
> 
> What did I do wrong? What's going on here?!
> Please help me out!
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx