Parallel I/O, by the way, is also awful. I can only reach 36000 write iops in a 14 NVMe cluster with size=2 and 1 OSD per NVMe. This is only ~2500 iops per drive... OK, even if I take 2x * 5x write amplification into account it's ~25000 iops per drive. And drives can push over 300000 iops :-(. Thankfully the performance isn't too critical in this environment. But it's very low :). Regards, Vitaliy > Hi Wido > > No. My results with Ceph (yeah I still use it) are the same, and I use Threadrippers which have > almost 4 GHz clockspeed. > > Network isn't the main problem. The main problem is a lot of program logic written in a complex way > which leads to high CPU usage. https://yourcmc.ru/wiki/Ceph_performance if you haven't already seen > it. > > I achieve ~7000 QD=1 iops with Vitastor just because it's much simpler. And I'm gradually > progressing feature-wise... :-) > > Regards, Vitaliy > >> (Sending it to dev list as people might know it there) >> >> Hi, >> >> There are many talks and presentations out there about Ceph's >> performance. Ceph is great when it comes to parallel I/O, large queue >> depths and many applications sending I/O towards Ceph. >> >> One thing where Ceph isn't the fastest are 4k blocks written at Queue >> Depth 1. >> >> Some applications benefit very much from high performance/low latency >> I/O at qd=1, for example Single Threaded applications which are writing >> small files inside a VM running on RBD. >> >> With some tuning you can get to a ~700us latency for a 4k write with >> qd=1 (Replication, size=3) >> >> I benchmark this using fio: >> >> $ fio --ioengine=librbd --bs=4k --iodepth=1 --direct=1 .. .. .. .. >> >> 700us latency means the result will be about ~1500 IOps (1000 / 0.7) >> >> When comparing this to let's say a BSD machine running ZFS that's on the >> low side. With ZFS+NVMe you'll be able to reach about somewhere between >> 7.000 and 10.000 IOps, the latency is simply much lower. >> >> My benchmarking / test setup for this: >> >> - Ceph Nautilus/Octopus (doesn't make a big difference) >> - 3x SuperMicro 1U with: >> - AMD Epyc 7302P 16-core CPU >> - 128GB DDR4 >> - 10x Samsung PM983 3,84TB >> - 10Gbit Base-T networking >> >> Things to configure/tune: >> >> - C-State pinning to 1 >> - CPU governer to performance >> - Turn off all logging in Ceph (debug_osd, debug_ms, debug_bluestore=0) >> >> Higher clock speeds (New AMD Epyc coming in March!) help to reduce the >> latency and going towards 25Gbit/100Gbit might help as well. >> >> These are however only very small increments and might help to reduce >> the latency by another 15% or so. >> >> It doesn't bring us anywhere near the 10k IOps other applications can do. >> >> And I totally understand that replication over a TCP/IP network takes >> time and thus increases latency. >> >> The Crimson project [0] is aiming to lower the latency with many things >> like DPDK and SPDK, but this is far from finished and production ready. >> >> In the meantime, am I overseeing some things here? Can we reduce the >> latency further of the current OSDs? >> >> Reaching a ~500us latency would already be great! >> >> Thanks, >> >> Wido >> >> [0]: https://docs.ceph.com/en/latest/dev/crimson/crimson >> _______________________________________________ >> Dev mailing list -- dev@xxxxxxx >> To unsubscribe send an email to dev-leave@xxxxxxx > > _______________________________________________ > Dev mailing list -- dev@xxxxxxx > To unsubscribe send an email to dev-leave@xxxxxxx _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx