Hello Gregory, Thanks for your input. * Ceph may not have the performance ceiling you're looking for. A > write IO takes about half a millisecond of CPU time, which used to be > very fast and is now pretty slow compared to an NVMe device. Crimson > will reduce this but is not ready for real users yet. Of course, it > scales out quite well, which your test is not going to explore with a > single client and 4 OSDs. It's impressive that it really takes ~0.5ms cpu time, thanks for confirming this. * I don't know what the *current* numbers are, but in the not-distant > past, 40k IOPs was about as much as a single rbd device could handle > on the client side. So whatever else is happening, there's a good > chance that's the limit you're actually exposing in this test. > Looks like that's really the case, because using real nvme devices compared to memory disks does not make much of a difference. * Your ram disks may not be as fast as you think they are under a > non-trivial load. Network IO, moving data between kernel and userspace > that Ceph has to do and local FIO doesn't, etc will all take up > roughly equivalent portions of that 10GB/s bandwidth you saw and split > up the streams, which may slow it down. Once your CPU has to do > anything else, it will be able to feed the RAM less quickly because > it's doing other things. Etc etc etc (Memory bandwidth is a *really* > complex topic.) > I'm fully with you, for sure i was not expecting as low latencies i would have compared to local fio. But factor 20000 was far too much. I see plenty of tests in the source code about crimson, especially performance tests. Do you know by any chance if crimson works against ceph 16 mon nodes? Or is it mandatory to have the mon nodes running ceph 17? Also i got confused about the memstore option and it seems.. it seems to ignore it and always create 1GB OSDs, can you confirm this? Even though its written here https://docs.ceph.com/en/latest/dev/crimson/crimson/ that it should support --memory Thanks! On Mon, Jan 31, 2022 at 6:06 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > There's a lot going on here. Some things I noticed you should be aware > of in relation to the tests you performed: > > * Ceph may not have the performance ceiling you're looking for. A > write IO takes about half a millisecond of CPU time, which used to be > very fast and is now pretty slow compared to an NVMe device. Crimson > will reduce this but is not ready for real users yet. Of course, it > scales out quite well, which your test is not going to explore with a > single client and 4 OSDs. > > * If you are seeing reported 4k IO latencies measured in seconds, > something has gone horribly wrong. Ceph doesn't do that, unless you > simply queue up so much work it can't keep up (and it tries to prevent > you from doing that). > > * I don't know what the *current* numbers are, but in the not-distant > past, 40k IOPs was about as much as a single rbd device could handle > on the client side. So whatever else is happening, there's a good > chance that's the limit you're actually exposing in this test. > > * Your ram disks may not be as fast as you think they are under a > non-trivial load. Network IO, moving data between kernel and userspace > that Ceph has to do and local FIO doesn't, etc will all take up > roughly equivalent portions of that 10GB/s bandwidth you saw and split > up the streams, which may slow it down. Once your CPU has to do > anything else, it will be able to feed the RAM less quickly because > it's doing other things. Etc etc etc (Memory bandwidth is a *really* > complex topic.) > > There are definitely proprietary distributed storage systems that can > go faster than Ceph, and there may be open-source ones — but most of > them don't provide the durability and consistency guarantees you'd > expect under a lot of failure scenarios. > -Greg > > > On Sat, Jan 29, 2022 at 8:42 PM sascha a. <sascha.arthur@xxxxxxxxx> wrote: > > > > Hello, > > > > Im currently in progress of setting up a production ceph cluster on a 40 > > gbit network (for sure 40gb internal and public network). > > > > Did a lot of machine/linux tweeking already: > > > > - cpupower state disable > > - lowlatency kernel > > - kernel tweekings > > - rx buffer optimize > > - affinity mappings > > - correct bus mapping of pcie cards > > - mtu.. > > + many more > > > > My machines are connected over a switch which is capable of doing > multiple > > TBit/s. > > > > iperf result between two machines: > > single connection ~20 Gbit/s (reaching single core limits) > > multiple connections ~39 Gbit/s > > > > Perfect starting point i would say. Let's assume we only reach 20 Gbit/s > as > > network speed. > > > > Now i wanted to check how much overhead ceph has and what's the absolute > > maximum i could get out of my cluster. > > > > Using 8 Servers (16 cores + HT), (all the same cpus/ram/network cards). > > Having plenty of RAM which I'm using for my tests. Simple pool by using > 3x > > replication. > > For that reason and to prove that my servers have high quality and speed > I > > created 70 GB RAM-drives on all of them. > > > > The RAM-drives were created by using the kernel module "brd". > > Benchmarking the RAM-drives gave the following result by using fio: > > > > 5m IO/s read@4k > > 4m IO/s write@4k > > > > read and write | latency below 10 us@QD=16 > > read and write | latency below 50 us@QD=256 > > > > 1,5 GB/s sequential read@4k (QD=1) > > 1,0 GB/s sequential write@4k (QD=1) > > > > 15 GB/s read@4k (QD=256) > > 10 GB/s write@4k (QD=256) > > > > Pretty impressive, disks we are all dreaming about I would say. > > > > Making sure i don't bottleneck anything i created following Setup: > > > > - 3 Servers, each running a mon,mgr and mds (all running in RAM including > > their dirs in /var/lib/ceph.. by using tmpfs/brd ramdisk) > > - 4 Servers mapping their ramdrive as OSDs, created with bluestore by > using > > ceph-volume raw or lvm. > > - 1 Server using rbd-nbd to map one rbd as drive and to benchmark it > > > > I would in this scenario expect impressive results, the only > > bottleneck between the servers is the 20 Gbit network speed. > > Everything else is running completely in low latency ECC memory and > should > > be blazing fast until the network speed is reached. > > > > The benchmark was monitored by using this tool here: > > https://github.com/ceph/ceph/blob/master/src/tools/histogram_dump.py > also > > by looking at the raw data of "ceph daemon osd.7 perf dump". > > > > > > *Result:* > > Either there is something really going wrong or ceph has huge bottlenecks > > inside the software which should be solved... > > > > histogram_dump spiked often the latency between "1M and 51M", ... which > is > > when i read it correctly seconds?! > > How is that possible? This should always be between 0-99k.. > > > > The result of perf dump was also crazy slow: > > https://pastebin.com/ukV0LXWH > > > > especially kpis here: > > - op_latency > > - op_w_latency > > - op_w_prepare_latency > > - state_deferred_cleanup_lat > > > > all in areas of milliseconds? > > > > FIO reached against the nbd drive 40k IOPs@4k with QD=256 and latency > which > > is in 1-7 milliseconds.. > > > > Calculating with 40gbit, 4k of data flows through the network > > about 4*1024/40000000000/8 = 128ns, lets say its 2us cause of the > overhead, > > multiplied by 3 RTT cause of the replication... should still reach > latency > > easy below 50us. > > Also I read a lot of slow flushing disks, but there's definitely > something > > going on in the software when it's not even capable of doing in memory > > disks without taking milliseconds.. > > > > > > All servers were using ubuntu 21.10 and "ceph version 16.2.6 > > (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable) " > > > > > > What did I do wrong? What's going on here?! > > Please help me out! > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx