Hey Vitalif, I found your wiki as well as your own software before. Pretty impressive and I love your work! I especially like your "Theoretical Maximum Random Access Performance" -Section. That is exactly what I would expect about cephs performance as well (which is by design very close to your vitastor) as well. Hoped to get your attention on this topic, because it seems you're a performance/perfection driven guy - like me, as well. Most people just accept the way software works and take bad performance for granted. Most excuses went on "cheap hardware", which I just eliminated by using ram disks. Anyways, did you ever test a ceph cluster against ram disks? I mean, even after consuming all of your wiki and using my 20+ Years of unix/programming knowledge, this just tells me that the used nvme (with or without capacitor) doesn't matter at all. When it is so badly utilized like it's currently looking like. I really can't imagine that the software is currently behaving.. aren't the devs testing sometimes performance? There must be very bad issues about synchronization/single threaded bottlenecks or missed events that it behaves THAT badly even with in memory osds. On the good side this means, just a few simple code tweaks (threading,async wait, queues) should bring a huge impact on performance. Currently I'm compiling ceph with crimson (the compile process seems also a bit messy). To check what the expected performance is with the new seastore backend. I will check Linstor as well, thanks! On Sun, Jan 30, 2022 at 6:37 PM <vitalif@xxxxxxxxxx> wrote: > Hi, yes, it has software bottlenecks :-) > > https://yourcmc.ru/wiki/Ceph_performance > > If you just need block storage - try Vitastor https://vitastor.io/ > https://yourcmc.ru/git/vitalif/vitastor/src/branch/master/README.md - I > made it very architecturally similar to Ceph - or if you're fine with even > simpler DRBD-based design (no distribution of every image across the whole > cluster) - even Linstor. > > Ceph is designed for multiple use cases at once, its design tradeoffs are > fine for S3 (cold/warm) object storage, but high-performance SSD-based > block storage is a different thing :-) > > > Hello, > > > > Im currently in progress of setting up a production ceph cluster on a 40 > > gbit network (for sure 40gb internal and public network). > > > > Did a lot of machine/linux tweeking already: > > > > - cpupower state disable > > - lowlatency kernel > > - kernel tweekings > > - rx buffer optimize > > - affinity mappings > > - correct bus mapping of pcie cards > > - mtu.. > > + many more > > > > My machines are connected over a switch which is capable of doing > multiple > > TBit/s. > > > > iperf result between two machines: > > single connection ~20 Gbit/s (reaching single core limits) > > multiple connections ~39 Gbit/s > > > > Perfect starting point i would say. Let's assume we only reach 20 Gbit/s > as > > network speed. > > > > Now i wanted to check how much overhead ceph has and what's the absolute > > maximum i could get out of my cluster. > > > > Using 8 Servers (16 cores + HT), (all the same cpus/ram/network cards). > > Having plenty of RAM which I'm using for my tests. Simple pool by using > 3x > > replication. > > For that reason and to prove that my servers have high quality and speed > I > > created 70 GB RAM-drives on all of them. > > > > The RAM-drives were created by using the kernel module "brd". > > Benchmarking the RAM-drives gave the following result by using fio: > > > > 5m IO/s read@4k > > 4m IO/s write@4k > > > > read and write | latency below 10 us@QD=16 > > read and write | latency below 50 us@QD=256 > > > > 1,5 GB/s sequential read@4k (QD=1) > > 1,0 GB/s sequential write@4k (QD=1) > > > > 15 GB/s read@4k (QD=256) > > 10 GB/s write@4k (QD=256) > > > > Pretty impressive, disks we are all dreaming about I would say. > > > > Making sure i don't bottleneck anything i created following Setup: > > > > - 3 Servers, each running a mon,mgr and mds (all running in RAM including > > their dirs in /var/lib/ceph.. by using tmpfs/brd ramdisk) > > - 4 Servers mapping their ramdrive as OSDs, created with bluestore by > using > > ceph-volume raw or lvm. > > - 1 Server using rbd-nbd to map one rbd as drive and to benchmark it > > > > I would in this scenario expect impressive results, the only > > bottleneck between the servers is the 20 Gbit network speed. > > Everything else is running completely in low latency ECC memory and > should > > be blazing fast until the network speed is reached. > > > > The benchmark was monitored by using this tool here: > > https://github.com/ceph/ceph/blob/master/src/tools/histogram_dump.py > also > > by looking at the raw data of "ceph daemon osd.7 perf dump". > > > > *Result:* > > Either there is something really going wrong or ceph has huge bottlenecks > > inside the software which should be solved... > > > > histogram_dump spiked often the latency between "1M and 51M", ... which > is > > when i read it correctly seconds?! > > How is that possible? This should always be between 0-99k.. > > > > The result of perf dump was also crazy slow: > > https://pastebin.com/ukV0LXWH > > > > especially kpis here: > > - op_latency > > - op_w_latency > > - op_w_prepare_latency > > - state_deferred_cleanup_lat > > > > all in areas of milliseconds? > > > > FIO reached against the nbd drive 40k IOPs@4k with QD=256 and latency > which > > is in 1-7 milliseconds.. > > > > Calculating with 40gbit, 4k of data flows through the network > > about 4*1024/40000000000/8 = 128ns, lets say its 2us cause of the > overhead, > > multiplied by 3 RTT cause of the replication... should still reach > latency > > easy below 50us. > > Also I read a lot of slow flushing disks, but there's definitely > something > > going on in the software when it's not even capable of doing in memory > > disks without taking milliseconds.. > > > > All servers were using ubuntu 21.10 and "ceph version 16.2.6 > > (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable) " > > > > What did I do wrong? What's going on here?! > > Please help me out! > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx