Yes I also tried it on ram disks and got basically the same results as with nvmes :-) capacitors do matter though because while you get 1000 t1q1 iops with them you get 100-200 iops without them. And it starts to resemble HDD at that point :-) >From my profiling experiments I think there's no way to make bluestore faster easily. I'm also interested in Crimson benchmarks so please post them here after testing :-) 30 января 2022 г. 21:23:55 GMT+03:00, "sascha a." <sascha.arthur@xxxxxxxxx> пишет: >Hey Vitalif, > >I found your wiki as well as your own software before. Pretty >impressive >and I love your work! >I especially like your "Theoretical Maximum Random Access Performance" >-Section. >That is exactly what I would expect about cephs performance as well >(which >is by design very close to your vitastor) as well. > >Hoped to get your attention on this topic, because it seems you're a >performance/perfection driven guy - like me, as well. >Most people just accept the way software works and take bad performance >for >granted. Most excuses went on "cheap hardware", which I just eliminated >by >using ram disks. > >Anyways, did you ever test a ceph cluster against ram disks? >I mean, even after consuming all of your wiki and using my 20+ Years of >unix/programming knowledge, this just tells me that the used nvme (with >or >without capacitor) doesn't matter at all. When it is so badly utilized >like >it's currently looking like. >I really can't imagine that the software is currently behaving.. aren't >the >devs testing sometimes performance? > >There must be very bad issues about synchronization/single threaded >bottlenecks or missed events that it behaves THAT badly even with in >memory >osds. >On the good side this means, just a few simple code tweaks >(threading,async >wait, queues) should bring a huge impact on performance. > >Currently I'm compiling ceph with crimson (the compile process seems >also a >bit messy). >To check what the expected performance is with the new seastore >backend. > >I will check Linstor as well, thanks! > > > >On Sun, Jan 30, 2022 at 6:37 PM <vitalif@xxxxxxxxxx> wrote: > >> Hi, yes, it has software bottlenecks :-) >> >> https://yourcmc.ru/wiki/Ceph_performance >> >> If you just need block storage - try Vitastor https://vitastor.io/ >> https://yourcmc.ru/git/vitalif/vitastor/src/branch/master/README.md - >I >> made it very architecturally similar to Ceph - or if you're fine with >even >> simpler DRBD-based design (no distribution of every image across the >whole >> cluster) - even Linstor. >> >> Ceph is designed for multiple use cases at once, its design tradeoffs >are >> fine for S3 (cold/warm) object storage, but high-performance >SSD-based >> block storage is a different thing :-) >> >> > Hello, >> > >> > Im currently in progress of setting up a production ceph cluster on >a 40 >> > gbit network (for sure 40gb internal and public network). >> > >> > Did a lot of machine/linux tweeking already: >> > >> > - cpupower state disable >> > - lowlatency kernel >> > - kernel tweekings >> > - rx buffer optimize >> > - affinity mappings >> > - correct bus mapping of pcie cards >> > - mtu.. >> > + many more >> > >> > My machines are connected over a switch which is capable of doing >> multiple >> > TBit/s. >> > >> > iperf result between two machines: >> > single connection ~20 Gbit/s (reaching single core limits) >> > multiple connections ~39 Gbit/s >> > >> > Perfect starting point i would say. Let's assume we only reach 20 >Gbit/s >> as >> > network speed. >> > >> > Now i wanted to check how much overhead ceph has and what's the >absolute >> > maximum i could get out of my cluster. >> > >> > Using 8 Servers (16 cores + HT), (all the same cpus/ram/network >cards). >> > Having plenty of RAM which I'm using for my tests. Simple pool by >using >> 3x >> > replication. >> > For that reason and to prove that my servers have high quality and >speed >> I >> > created 70 GB RAM-drives on all of them. >> > >> > The RAM-drives were created by using the kernel module "brd". >> > Benchmarking the RAM-drives gave the following result by using fio: >> > >> > 5m IO/s read@4k >> > 4m IO/s write@4k >> > >> > read and write | latency below 10 us@QD=16 >> > read and write | latency below 50 us@QD=256 >> > >> > 1,5 GB/s sequential read@4k (QD=1) >> > 1,0 GB/s sequential write@4k (QD=1) >> > >> > 15 GB/s read@4k (QD=256) >> > 10 GB/s write@4k (QD=256) >> > >> > Pretty impressive, disks we are all dreaming about I would say. >> > >> > Making sure i don't bottleneck anything i created following Setup: >> > >> > - 3 Servers, each running a mon,mgr and mds (all running in RAM >including >> > their dirs in /var/lib/ceph.. by using tmpfs/brd ramdisk) >> > - 4 Servers mapping their ramdrive as OSDs, created with bluestore >by >> using >> > ceph-volume raw or lvm. >> > - 1 Server using rbd-nbd to map one rbd as drive and to benchmark >it >> > >> > I would in this scenario expect impressive results, the only >> > bottleneck between the servers is the 20 Gbit network speed. >> > Everything else is running completely in low latency ECC memory and >> should >> > be blazing fast until the network speed is reached. >> > >> > The benchmark was monitored by using this tool here: >> > >https://github.com/ceph/ceph/blob/master/src/tools/histogram_dump.py >> also >> > by looking at the raw data of "ceph daemon osd.7 perf dump". >> > >> > *Result:* >> > Either there is something really going wrong or ceph has huge >bottlenecks >> > inside the software which should be solved... >> > >> > histogram_dump spiked often the latency between "1M and 51M", ... >which >> is >> > when i read it correctly seconds?! >> > How is that possible? This should always be between 0-99k.. >> > >> > The result of perf dump was also crazy slow: >> > https://pastebin.com/ukV0LXWH >> > >> > especially kpis here: >> > - op_latency >> > - op_w_latency >> > - op_w_prepare_latency >> > - state_deferred_cleanup_lat >> > >> > all in areas of milliseconds? >> > >> > FIO reached against the nbd drive 40k IOPs@4k with QD=256 and >latency >> which >> > is in 1-7 milliseconds.. >> > >> > Calculating with 40gbit, 4k of data flows through the network >> > about 4*1024/40000000000/8 = 128ns, lets say its 2us cause of the >> overhead, >> > multiplied by 3 RTT cause of the replication... should still reach >> latency >> > easy below 50us. >> > Also I read a lot of slow flushing disks, but there's definitely >> something >> > going on in the software when it's not even capable of doing in >memory >> > disks without taking milliseconds.. >> > >> > All servers were using ubuntu 21.10 and "ceph version 16.2.6 >> > (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable) " >> > >> > What did I do wrong? What's going on here?! >> > Please help me out! >> > _______________________________________________ >> > ceph-users mailing list -- ceph-users@xxxxxxx >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx >> -- With best regards, Vitaliy Filippov _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx