Re: Increasing QD=1 performance (lowering latency)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Wido,

do you know what happened to mellanox's ceph rdma project of 2018?

We will test ARM Ampere for all-flash this half-year and probably get the opportunity to experiment with software defined memory.

Regards, Joachim

___________________________________

Clyso GmbH

Am 08.02.2021 um 14:21 schrieb Paul Emmerich:
A few things that you can try on the network side to shave off microseconds:

1) 10G Base-T has quite some latency compared to fiber or DAC. I've
measured 2 µs on Base-T vs. 0.3µs on fiber for one link in one direction,
so that's 8µs you can save for a round-trip if it's client -> switch -> osd
and back. Note that my measurement was for small packets, not sure how big
that penalty still is with large packets. Some of it comes from the large
block size (~3 kbit IIRC) of the layer 1 encoding, some is just processing
time of that complex encoding.

2) Setting the switch to cut-through instead of store-and-forward can help,
especially on slower links. Serialization time is 0.8ns per byte on 10
gbit, so ~3.2µs for a 4kb packet.

3) Depending on which NIC you use: check if it has some kind of interrupt
throttling feature that you can adjust or disable. If your Base-T NIC is an
Intel NIC, especially on the older Niantic ones (i.e. X5xx X5xx using ixgbe
probably also X7xx, i40e), that can make a large difference. Try setting
itr=0 for the ixgbe kernel module. Note that you might want to compile your
kernel with CONFIG_IRQ_TIME_ACCOUNTING when using this option, otherwise
CPU usage statistics will be wildly inaccurate if the driver takes a
significant amount of CPU time (should not be a problem for the setup
described here, but something to be aware of). This may get you up to 100µs
in the best case. No idea about other NICs

4) No idea about the state in Ceph, but: SO_BUSY_POLL on sockets does help
with latency, but I forgot the details

5) Correct NUMA pinning (a single socket AMD system is NUMA) can reduce
tail latency, but doesn't do anything for average and median latency and I
have no insights specific to Ceph, though.


This could get you a few microseconds, I think especially 3 and 4 are worth
trying. Please do report results if you test this, I'm always interested in
hearing stories about low-level performance optimizations :)

Paul



On Tue, Feb 2, 2021 at 10:17 AM Wido den Hollander <wido@xxxxxxxx> wrote:

Hi,

There are many talks and presentations out there about Ceph's
performance. Ceph is great when it comes to parallel I/O, large queue
depths and many applications sending I/O towards Ceph.

One thing where Ceph isn't the fastest are 4k blocks written at Queue
Depth 1.

Some applications benefit very much from high performance/low latency
I/O at qd=1, for example Single Threaded applications which are writing
small files inside a VM running on RBD.

With some tuning you can get to a ~700us latency for a 4k write with
qd=1 (Replication, size=3)

I benchmark this using fio:

$ fio --ioengine=librbd --bs=4k --iodepth=1 --direct=1 .. .. .. ..

700us latency means the result will be about ~1500 IOps (1000 / 0.7)

When comparing this to let's say a BSD machine running ZFS that's on the
low side. With ZFS+NVMe you'll be able to reach about somewhere between
7.000 and 10.000 IOps, the latency is simply much lower.

My benchmarking / test setup for this:

- Ceph Nautilus/Octopus (doesn't make a big difference)
- 3x SuperMicro 1U with:
- AMD Epyc 7302P 16-core CPU
- 128GB DDR4
- 10x Samsung PM983 3,84TB
- 10Gbit Base-T networking

Things to configure/tune:

- C-State pinning to 1
- CPU governer to performance
- Turn off all logging in Ceph (debug_osd, debug_ms, debug_bluestore=0)

Higher clock speeds (New AMD Epyc coming in March!) help to reduce the
latency and going towards 25Gbit/100Gbit might help as well.

These are however only very small increments and might help to reduce
the latency by another 15% or so.

It doesn't bring us anywhere near the 10k IOps other applications can do.

And I totally understand that replication over a TCP/IP network takes
time and thus increases latency.

The Crimson project [0] is aiming to lower the latency with many things
like DPDK and SPDK, but this is far from finished and production ready.

In the meantime, am I overseeing some things here? Can we reduce the
latency further of the current OSDs?

Reaching a ~500us latency would already be great!

Thanks,

Wido


[0]: https://docs.ceph.com/en/latest/dev/crimson/crimson/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux