Re: Increasing QD=1 performance (lowering latency)

Wido den Hollander <wido@xxxxxxxx> · Fri, 5 Feb 2021 17:53:49 +0100

On 05/02/2021 17:24, vitalif@xxxxxxxxxx wrote:
Hi Wido

No. My results with Ceph (yeah I still use it) are the same, and I use Threadrippers which have almost 4 GHz clockspeed.

Understood.

Network isn't the main problem. The main problem is a lot of program logic written in a complex way which leads to high CPU usage. https://yourcmc.ru/wiki/Ceph_performance if you haven't already seen it.

I'm aware of that Wiki. Network isn't the main issue indeed. Therefor 
most of the systems I design still run on 10G as anything beyond that 
doesn't really benefit the cluster in terms of latency.

I achieve ~7000 QD=1 iops with Vitastor just because it's much simpler. And I'm gradually progressing feature-wise... :-)

I'm not looking for something else then Ceph :-)

Wido

Regards, Vitaliy

(Sending it to dev list as people might know it there)

Hi,

There are many talks and presentations out there about Ceph's
performance. Ceph is great when it comes to parallel I/O, large queue
depths and many applications sending I/O towards Ceph.

One thing where Ceph isn't the fastest are 4k blocks written at Queue
Depth 1.

Some applications benefit very much from high performance/low latency
I/O at qd=1, for example Single Threaded applications which are writing
small files inside a VM running on RBD.

With some tuning you can get to a ~700us latency for a 4k write with
qd=1 (Replication, size=3)

I benchmark this using fio:

$ fio --ioengine=librbd --bs=4k --iodepth=1 --direct=1 .. .. .. ..

700us latency means the result will be about ~1500 IOps (1000 / 0.7)

When comparing this to let's say a BSD machine running ZFS that's on the
low side. With ZFS+NVMe you'll be able to reach about somewhere between
7.000 and 10.000 IOps, the latency is simply much lower.

My benchmarking / test setup for this:

- Ceph Nautilus/Octopus (doesn't make a big difference)
- 3x SuperMicro 1U with:
- AMD Epyc 7302P 16-core CPU
- 128GB DDR4
- 10x Samsung PM983 3,84TB
- 10Gbit Base-T networking

Things to configure/tune:

- C-State pinning to 1
- CPU governer to performance
- Turn off all logging in Ceph (debug_osd, debug_ms, debug_bluestore=0)

Higher clock speeds (New AMD Epyc coming in March!) help to reduce the
latency and going towards 25Gbit/100Gbit might help as well.

These are however only very small increments and might help to reduce
the latency by another 15% or so.

It doesn't bring us anywhere near the 10k IOps other applications can do.

And I totally understand that replication over a TCP/IP network takes
time and thus increases latency.

The Crimson project [0] is aiming to lower the latency with many things
like DPDK and SPDK, but this is far from finished and production ready.

In the meantime, am I overseeing some things here? Can we reduce the
latency further of the current OSDs?

Reaching a ~500us latency would already be great!

Thanks,

Wido

[0]: https://docs.ceph.com/en/latest/dev/crimson/crimson
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx