Hi Wido No. My results with Ceph (yeah I still use it) are the same, and I use Threadrippers which have almost 4 GHz clockspeed. Network isn't the main problem. The main problem is a lot of program logic written in a complex way which leads to high CPU usage. https://yourcmc.ru/wiki/Ceph_performance if you haven't already seen it. I achieve ~7000 QD=1 iops with Vitastor just because it's much simpler. And I'm gradually progressing feature-wise... :-) Regards, Vitaliy > (Sending it to dev list as people might know it there) > > Hi, > > There are many talks and presentations out there about Ceph's > performance. Ceph is great when it comes to parallel I/O, large queue > depths and many applications sending I/O towards Ceph. > > One thing where Ceph isn't the fastest are 4k blocks written at Queue > Depth 1. > > Some applications benefit very much from high performance/low latency > I/O at qd=1, for example Single Threaded applications which are writing > small files inside a VM running on RBD. > > With some tuning you can get to a ~700us latency for a 4k write with > qd=1 (Replication, size=3) > > I benchmark this using fio: > > $ fio --ioengine=librbd --bs=4k --iodepth=1 --direct=1 .. .. .. .. > > 700us latency means the result will be about ~1500 IOps (1000 / 0.7) > > When comparing this to let's say a BSD machine running ZFS that's on the > low side. With ZFS+NVMe you'll be able to reach about somewhere between > 7.000 and 10.000 IOps, the latency is simply much lower. > > My benchmarking / test setup for this: > > - Ceph Nautilus/Octopus (doesn't make a big difference) > - 3x SuperMicro 1U with: > - AMD Epyc 7302P 16-core CPU > - 128GB DDR4 > - 10x Samsung PM983 3,84TB > - 10Gbit Base-T networking > > Things to configure/tune: > > - C-State pinning to 1 > - CPU governer to performance > - Turn off all logging in Ceph (debug_osd, debug_ms, debug_bluestore=0) > > Higher clock speeds (New AMD Epyc coming in March!) help to reduce the > latency and going towards 25Gbit/100Gbit might help as well. > > These are however only very small increments and might help to reduce > the latency by another 15% or so. > > It doesn't bring us anywhere near the 10k IOps other applications can do. > > And I totally understand that replication over a TCP/IP network takes > time and thus increases latency. > > The Crimson project [0] is aiming to lower the latency with many things > like DPDK and SPDK, but this is far from finished and production ready. > > In the meantime, am I overseeing some things here? Can we reduce the > latency further of the current OSDs? > > Reaching a ~500us latency would already be great! > > Thanks, > > Wido > > [0]: https://docs.ceph.com/en/latest/dev/crimson/crimson > _______________________________________________ > Dev mailing list -- dev@xxxxxxx > To unsubscribe send an email to dev-leave@xxxxxxx _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx