Hi, yes, it has software bottlenecks :-) https://yourcmc.ru/wiki/Ceph_performance If you just need block storage - try Vitastor https://vitastor.io/ https://yourcmc.ru/git/vitalif/vitastor/src/branch/master/README.md - I made it very architecturally similar to Ceph - or if you're fine with even simpler DRBD-based design (no distribution of every image across the whole cluster) - even Linstor. Ceph is designed for multiple use cases at once, its design tradeoffs are fine for S3 (cold/warm) object storage, but high-performance SSD-based block storage is a different thing :-) > Hello, > > Im currently in progress of setting up a production ceph cluster on a 40 > gbit network (for sure 40gb internal and public network). > > Did a lot of machine/linux tweeking already: > > - cpupower state disable > - lowlatency kernel > - kernel tweekings > - rx buffer optimize > - affinity mappings > - correct bus mapping of pcie cards > - mtu.. > + many more > > My machines are connected over a switch which is capable of doing multiple > TBit/s. > > iperf result between two machines: > single connection ~20 Gbit/s (reaching single core limits) > multiple connections ~39 Gbit/s > > Perfect starting point i would say. Let's assume we only reach 20 Gbit/s as > network speed. > > Now i wanted to check how much overhead ceph has and what's the absolute > maximum i could get out of my cluster. > > Using 8 Servers (16 cores + HT), (all the same cpus/ram/network cards). > Having plenty of RAM which I'm using for my tests. Simple pool by using 3x > replication. > For that reason and to prove that my servers have high quality and speed I > created 70 GB RAM-drives on all of them. > > The RAM-drives were created by using the kernel module "brd". > Benchmarking the RAM-drives gave the following result by using fio: > > 5m IO/s read@4k > 4m IO/s write@4k > > read and write | latency below 10 us@QD=16 > read and write | latency below 50 us@QD=256 > > 1,5 GB/s sequential read@4k (QD=1) > 1,0 GB/s sequential write@4k (QD=1) > > 15 GB/s read@4k (QD=256) > 10 GB/s write@4k (QD=256) > > Pretty impressive, disks we are all dreaming about I would say. > > Making sure i don't bottleneck anything i created following Setup: > > - 3 Servers, each running a mon,mgr and mds (all running in RAM including > their dirs in /var/lib/ceph.. by using tmpfs/brd ramdisk) > - 4 Servers mapping their ramdrive as OSDs, created with bluestore by using > ceph-volume raw or lvm. > - 1 Server using rbd-nbd to map one rbd as drive and to benchmark it > > I would in this scenario expect impressive results, the only > bottleneck between the servers is the 20 Gbit network speed. > Everything else is running completely in low latency ECC memory and should > be blazing fast until the network speed is reached. > > The benchmark was monitored by using this tool here: > https://github.com/ceph/ceph/blob/master/src/tools/histogram_dump.py also > by looking at the raw data of "ceph daemon osd.7 perf dump". > > *Result:* > Either there is something really going wrong or ceph has huge bottlenecks > inside the software which should be solved... > > histogram_dump spiked often the latency between "1M and 51M", ... which is > when i read it correctly seconds?! > How is that possible? This should always be between 0-99k.. > > The result of perf dump was also crazy slow: > https://pastebin.com/ukV0LXWH > > especially kpis here: > - op_latency > - op_w_latency > - op_w_prepare_latency > - state_deferred_cleanup_lat > > all in areas of milliseconds? > > FIO reached against the nbd drive 40k IOPs@4k with QD=256 and latency which > is in 1-7 milliseconds.. > > Calculating with 40gbit, 4k of data flows through the network > about 4*1024/40000000000/8 = 128ns, lets say its 2us cause of the overhead, > multiplied by 3 RTT cause of the replication... should still reach latency > easy below 50us. > Also I read a lot of slow flushing disks, but there's definitely something > going on in the software when it's not even capable of doing in memory > disks without taking milliseconds.. > > All servers were using ubuntu 21.10 and "ceph version 16.2.6 > (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable) " > > What did I do wrong? What's going on here?! > Please help me out! > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx