There's a lot going on here. Some things I noticed you should be aware of in relation to the tests you performed: * Ceph may not have the performance ceiling you're looking for. A write IO takes about half a millisecond of CPU time, which used to be very fast and is now pretty slow compared to an NVMe device. Crimson will reduce this but is not ready for real users yet. Of course, it scales out quite well, which your test is not going to explore with a single client and 4 OSDs. * If you are seeing reported 4k IO latencies measured in seconds, something has gone horribly wrong. Ceph doesn't do that, unless you simply queue up so much work it can't keep up (and it tries to prevent you from doing that). * I don't know what the *current* numbers are, but in the not-distant past, 40k IOPs was about as much as a single rbd device could handle on the client side. So whatever else is happening, there's a good chance that's the limit you're actually exposing in this test. * Your ram disks may not be as fast as you think they are under a non-trivial load. Network IO, moving data between kernel and userspace that Ceph has to do and local FIO doesn't, etc will all take up roughly equivalent portions of that 10GB/s bandwidth you saw and split up the streams, which may slow it down. Once your CPU has to do anything else, it will be able to feed the RAM less quickly because it's doing other things. Etc etc etc (Memory bandwidth is a *really* complex topic.) There are definitely proprietary distributed storage systems that can go faster than Ceph, and there may be open-source ones — but most of them don't provide the durability and consistency guarantees you'd expect under a lot of failure scenarios. -Greg On Sat, Jan 29, 2022 at 8:42 PM sascha a. <sascha.arthur@xxxxxxxxx> wrote: > > Hello, > > Im currently in progress of setting up a production ceph cluster on a 40 > gbit network (for sure 40gb internal and public network). > > Did a lot of machine/linux tweeking already: > > - cpupower state disable > - lowlatency kernel > - kernel tweekings > - rx buffer optimize > - affinity mappings > - correct bus mapping of pcie cards > - mtu.. > + many more > > My machines are connected over a switch which is capable of doing multiple > TBit/s. > > iperf result between two machines: > single connection ~20 Gbit/s (reaching single core limits) > multiple connections ~39 Gbit/s > > Perfect starting point i would say. Let's assume we only reach 20 Gbit/s as > network speed. > > Now i wanted to check how much overhead ceph has and what's the absolute > maximum i could get out of my cluster. > > Using 8 Servers (16 cores + HT), (all the same cpus/ram/network cards). > Having plenty of RAM which I'm using for my tests. Simple pool by using 3x > replication. > For that reason and to prove that my servers have high quality and speed I > created 70 GB RAM-drives on all of them. > > The RAM-drives were created by using the kernel module "brd". > Benchmarking the RAM-drives gave the following result by using fio: > > 5m IO/s read@4k > 4m IO/s write@4k > > read and write | latency below 10 us@QD=16 > read and write | latency below 50 us@QD=256 > > 1,5 GB/s sequential read@4k (QD=1) > 1,0 GB/s sequential write@4k (QD=1) > > 15 GB/s read@4k (QD=256) > 10 GB/s write@4k (QD=256) > > Pretty impressive, disks we are all dreaming about I would say. > > Making sure i don't bottleneck anything i created following Setup: > > - 3 Servers, each running a mon,mgr and mds (all running in RAM including > their dirs in /var/lib/ceph.. by using tmpfs/brd ramdisk) > - 4 Servers mapping their ramdrive as OSDs, created with bluestore by using > ceph-volume raw or lvm. > - 1 Server using rbd-nbd to map one rbd as drive and to benchmark it > > I would in this scenario expect impressive results, the only > bottleneck between the servers is the 20 Gbit network speed. > Everything else is running completely in low latency ECC memory and should > be blazing fast until the network speed is reached. > > The benchmark was monitored by using this tool here: > https://github.com/ceph/ceph/blob/master/src/tools/histogram_dump.py also > by looking at the raw data of "ceph daemon osd.7 perf dump". > > > *Result:* > Either there is something really going wrong or ceph has huge bottlenecks > inside the software which should be solved... > > histogram_dump spiked often the latency between "1M and 51M", ... which is > when i read it correctly seconds?! > How is that possible? This should always be between 0-99k.. > > The result of perf dump was also crazy slow: > https://pastebin.com/ukV0LXWH > > especially kpis here: > - op_latency > - op_w_latency > - op_w_prepare_latency > - state_deferred_cleanup_lat > > all in areas of milliseconds? > > FIO reached against the nbd drive 40k IOPs@4k with QD=256 and latency which > is in 1-7 milliseconds.. > > Calculating with 40gbit, 4k of data flows through the network > about 4*1024/40000000000/8 = 128ns, lets say its 2us cause of the overhead, > multiplied by 3 RTT cause of the replication... should still reach latency > easy below 50us. > Also I read a lot of slow flushing disks, but there's definitely something > going on in the software when it's not even capable of doing in memory > disks without taking milliseconds.. > > > All servers were using ubuntu 21.10 and "ceph version 16.2.6 > (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable) " > > > What did I do wrong? What's going on here?! > Please help me out! > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx