Re: Increasing QD=1 performance (lowering latency)

vitalif@xxxxxxxxxx · Fri, 05 Feb 2021 16:30:31 +0000

Parallel I/O, by the way, is also awful. I can only reach 36000 write iops in a 14 NVMe cluster with size=2 and 1 OSD per NVMe. This is only ~2500 iops per drive... OK, even if I take 2x * 5x write amplification into account it's ~25000 iops per drive. And drives can push over 300000 iops :-(.

Thankfully the performance isn't too critical in this environment. But it's very low :).

Regards, Vitaliy

> Hi Wido
> 
> No. My results with Ceph (yeah I still use it) are the same, and I use Threadrippers which have
> almost 4 GHz clockspeed.
> 
> Network isn't the main problem. The main problem is a lot of program logic written in a complex way
> which leads to high CPU usage. https://yourcmc.ru/wiki/Ceph_performance if you haven't already seen
> it.
> 
> I achieve ~7000 QD=1 iops with Vitastor just because it's much simpler. And I'm gradually
> progressing feature-wise... :-)
> 
> Regards, Vitaliy
> 
>> (Sending it to dev list as people might know it there)
>> 
>> Hi,
>> 
>> There are many talks and presentations out there about Ceph's
>> performance. Ceph is great when it comes to parallel I/O, large queue
>> depths and many applications sending I/O towards Ceph.
>> 
>> One thing where Ceph isn't the fastest are 4k blocks written at Queue
>> Depth 1.
>> 
>> Some applications benefit very much from high performance/low latency
>> I/O at qd=1, for example Single Threaded applications which are writing
>> small files inside a VM running on RBD.
>> 
>> With some tuning you can get to a ~700us latency for a 4k write with
>> qd=1 (Replication, size=3)
>> 
>> I benchmark this using fio:
>> 
>> $ fio --ioengine=librbd --bs=4k --iodepth=1 --direct=1 .. .. .. ..
>> 
>> 700us latency means the result will be about ~1500 IOps (1000 / 0.7)
>> 
>> When comparing this to let's say a BSD machine running ZFS that's on the
>> low side. With ZFS+NVMe you'll be able to reach about somewhere between
>> 7.000 and 10.000 IOps, the latency is simply much lower.
>> 
>> My benchmarking / test setup for this:
>> 
>> - Ceph Nautilus/Octopus (doesn't make a big difference)
>> - 3x SuperMicro 1U with:
>> - AMD Epyc 7302P 16-core CPU
>> - 128GB DDR4
>> - 10x Samsung PM983 3,84TB
>> - 10Gbit Base-T networking
>> 
>> Things to configure/tune:
>> 
>> - C-State pinning to 1
>> - CPU governer to performance
>> - Turn off all logging in Ceph (debug_osd, debug_ms, debug_bluestore=0)
>> 
>> Higher clock speeds (New AMD Epyc coming in March!) help to reduce the
>> latency and going towards 25Gbit/100Gbit might help as well.
>> 
>> These are however only very small increments and might help to reduce
>> the latency by another 15% or so.
>> 
>> It doesn't bring us anywhere near the 10k IOps other applications can do.
>> 
>> And I totally understand that replication over a TCP/IP network takes
>> time and thus increases latency.
>> 
>> The Crimson project [0] is aiming to lower the latency with many things
>> like DPDK and SPDK, but this is far from finished and production ready.
>> 
>> In the meantime, am I overseeing some things here? Can we reduce the
>> latency further of the current OSDs?
>> 
>> Reaching a ~500us latency would already be great!
>> 
>> Thanks,
>> 
>> Wido
>> 
>> [0]: https://docs.ceph.com/en/latest/dev/crimson/crimson
>> _______________________________________________
>> Dev mailing list -- dev@xxxxxxx
>> To unsubscribe send an email to dev-leave@xxxxxxx
> 
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx