Re: Increasing QD=1 performance (lowering latency)

vitalif@xxxxxxxxxx · Fri, 05 Feb 2021 16:24:45 +0000

Hi Wido

No. My results with Ceph (yeah I still use it) are the same, and I use Threadrippers which have almost 4 GHz clockspeed.

Network isn't the main problem. The main problem is a lot of program logic written in a complex way which leads to high CPU usage. https://yourcmc.ru/wiki/Ceph_performance if you haven't already seen it.

I achieve ~7000 QD=1 iops with Vitastor just because it's much simpler. And I'm gradually progressing feature-wise... :-)

Regards, Vitaliy

> (Sending it to dev list as people might know it there)
> 
> Hi,
> 
> There are many talks and presentations out there about Ceph's
> performance. Ceph is great when it comes to parallel I/O, large queue
> depths and many applications sending I/O towards Ceph.
> 
> One thing where Ceph isn't the fastest are 4k blocks written at Queue
> Depth 1.
> 
> Some applications benefit very much from high performance/low latency
> I/O at qd=1, for example Single Threaded applications which are writing
> small files inside a VM running on RBD.
> 
> With some tuning you can get to a ~700us latency for a 4k write with
> qd=1 (Replication, size=3)
> 
> I benchmark this using fio:
> 
> $ fio --ioengine=librbd --bs=4k --iodepth=1 --direct=1 .. .. .. ..
> 
> 700us latency means the result will be about ~1500 IOps (1000 / 0.7)
> 
> When comparing this to let's say a BSD machine running ZFS that's on the
> low side. With ZFS+NVMe you'll be able to reach about somewhere between
> 7.000 and 10.000 IOps, the latency is simply much lower.
> 
> My benchmarking / test setup for this:
> 
> - Ceph Nautilus/Octopus (doesn't make a big difference)
> - 3x SuperMicro 1U with:
> - AMD Epyc 7302P 16-core CPU
> - 128GB DDR4
> - 10x Samsung PM983 3,84TB
> - 10Gbit Base-T networking
> 
> Things to configure/tune:
> 
> - C-State pinning to 1
> - CPU governer to performance
> - Turn off all logging in Ceph (debug_osd, debug_ms, debug_bluestore=0)
> 
> Higher clock speeds (New AMD Epyc coming in March!) help to reduce the
> latency and going towards 25Gbit/100Gbit might help as well.
> 
> These are however only very small increments and might help to reduce
> the latency by another 15% or so.
> 
> It doesn't bring us anywhere near the 10k IOps other applications can do.
> 
> And I totally understand that replication over a TCP/IP network takes
> time and thus increases latency.
> 
> The Crimson project [0] is aiming to lower the latency with many things
> like DPDK and SPDK, but this is far from finished and production ready.
> 
> In the meantime, am I overseeing some things here? Can we reduce the
> latency further of the current OSDs?
> 
> Reaching a ~500us latency would already be great!
> 
> Thanks,
> 
> Wido
> 
> [0]: https://docs.ceph.com/en/latest/dev/crimson/crimson
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx