Hello, Im currently in progress of setting up a production ceph cluster on a 40 gbit network (for sure 40gb internal and public network). Did a lot of machine/linux tweeking already: - cpupower state disable - lowlatency kernel - kernel tweekings - rx buffer optimize - affinity mappings - correct bus mapping of pcie cards - mtu.. + many more My machines are connected over a switch which is capable of doing multiple TBit/s. iperf result between two machines: single connection ~20 Gbit/s (reaching single core limits) multiple connections ~39 Gbit/s Perfect starting point i would say. Let's assume we only reach 20 Gbit/s as network speed. Now i wanted to check how much overhead ceph has and what's the absolute maximum i could get out of my cluster. Using 8 Servers (16 cores + HT), (all the same cpus/ram/network cards). Having plenty of RAM which I'm using for my tests. Simple pool by using 3x replication. For that reason and to prove that my servers have high quality and speed I created 70 GB RAM-drives on all of them. The RAM-drives were created by using the kernel module "brd". Benchmarking the RAM-drives gave the following result by using fio: 5m IO/s read@4k 4m IO/s write@4k read and write | latency below 10 us@QD=16 read and write | latency below 50 us@QD=256 1,5 GB/s sequential read@4k (QD=1) 1,0 GB/s sequential write@4k (QD=1) 15 GB/s read@4k (QD=256) 10 GB/s write@4k (QD=256) Pretty impressive, disks we are all dreaming about I would say. Making sure i don't bottleneck anything i created following Setup: - 3 Servers, each running a mon,mgr and mds (all running in RAM including their dirs in /var/lib/ceph.. by using tmpfs/brd ramdisk) - 4 Servers mapping their ramdrive as OSDs, created with bluestore by using ceph-volume raw or lvm. - 1 Server using rbd-nbd to map one rbd as drive and to benchmark it I would in this scenario expect impressive results, the only bottleneck between the servers is the 20 Gbit network speed. Everything else is running completely in low latency ECC memory and should be blazing fast until the network speed is reached. The benchmark was monitored by using this tool here: https://github.com/ceph/ceph/blob/master/src/tools/histogram_dump.py also by looking at the raw data of "ceph daemon osd.7 perf dump". *Result:* Either there is something really going wrong or ceph has huge bottlenecks inside the software which should be solved... histogram_dump spiked often the latency between "1M and 51M", ... which is when i read it correctly seconds?! How is that possible? This should always be between 0-99k.. The result of perf dump was also crazy slow: https://pastebin.com/ukV0LXWH especially kpis here: - op_latency - op_w_latency - op_w_prepare_latency - state_deferred_cleanup_lat all in areas of milliseconds? FIO reached against the nbd drive 40k IOPs@4k with QD=256 and latency which is in 1-7 milliseconds.. Calculating with 40gbit, 4k of data flows through the network about 4*1024/40000000000/8 = 128ns, lets say its 2us cause of the overhead, multiplied by 3 RTT cause of the replication... should still reach latency easy below 50us. Also I read a lot of slow flushing disks, but there's definitely something going on in the software when it's not even capable of doing in memory disks without taking milliseconds.. All servers were using ubuntu 21.10 and "ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable) " What did I do wrong? What's going on here?! Please help me out! _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx