Re: Increasing QD=1 performance (lowering latency)

Paul Emmerich <emmerich@xxxxxxxxxx> · Mon, 8 Feb 2021 14:21:47 +0100

A few things that you can try on the network side to shave off microseconds:

1) 10G Base-T has quite some latency compared to fiber or DAC. I've
measured 2 µs on Base-T vs. 0.3µs on fiber for one link in one direction,
so that's 8µs you can save for a round-trip if it's client -> switch -> osd
and back. Note that my measurement was for small packets, not sure how big
that penalty still is with large packets. Some of it comes from the large
block size (~3 kbit IIRC) of the layer 1 encoding, some is just processing
time of that complex encoding.

2) Setting the switch to cut-through instead of store-and-forward can help,
especially on slower links. Serialization time is 0.8ns per byte on 10
gbit, so ~3.2µs for a 4kb packet.

3) Depending on which NIC you use: check if it has some kind of interrupt
throttling feature that you can adjust or disable. If your Base-T NIC is an
Intel NIC, especially on the older Niantic ones (i.e. X5xx X5xx using ixgbe
probably also X7xx, i40e), that can make a large difference. Try setting
itr=0 for the ixgbe kernel module. Note that you might want to compile your
kernel with CONFIG_IRQ_TIME_ACCOUNTING when using this option, otherwise
CPU usage statistics will be wildly inaccurate if the driver takes a
significant amount of CPU time (should not be a problem for the setup
described here, but something to be aware of). This may get you up to 100µs
in the best case. No idea about other NICs

4) No idea about the state in Ceph, but: SO_BUSY_POLL on sockets does help
with latency, but I forgot the details

5) Correct NUMA pinning (a single socket AMD system is NUMA) can reduce
tail latency, but doesn't do anything for average and median latency and I
have no insights specific to Ceph, though.

This could get you a few microseconds, I think especially 3 and 4 are worth
trying. Please do report results if you test this, I'm always interested in
hearing stories about low-level performance optimizations :)

Paul

On Tue, Feb 2, 2021 at 10:17 AM Wido den Hollander <wido@xxxxxxxx> wrote:

> Hi,
>
> There are many talks and presentations out there about Ceph's
> performance. Ceph is great when it comes to parallel I/O, large queue
> depths and many applications sending I/O towards Ceph.
>
> One thing where Ceph isn't the fastest are 4k blocks written at Queue
> Depth 1.
>
> Some applications benefit very much from high performance/low latency
> I/O at qd=1, for example Single Threaded applications which are writing
> small files inside a VM running on RBD.
>
> With some tuning you can get to a ~700us latency for a 4k write with
> qd=1 (Replication, size=3)
>
> I benchmark this using fio:
>
> $ fio --ioengine=librbd --bs=4k --iodepth=1 --direct=1 .. .. .. ..
>
> 700us latency means the result will be about ~1500 IOps (1000 / 0.7)
>
> When comparing this to let's say a BSD machine running ZFS that's on the
> low side. With ZFS+NVMe you'll be able to reach about somewhere between
> 7.000 and 10.000 IOps, the latency is simply much lower.
>
> My benchmarking / test setup for this:
>
> - Ceph Nautilus/Octopus (doesn't make a big difference)
> - 3x SuperMicro 1U with:
> - AMD Epyc 7302P 16-core CPU
> - 128GB DDR4
> - 10x Samsung PM983 3,84TB
> - 10Gbit Base-T networking
>
> Things to configure/tune:
>
> - C-State pinning to 1
> - CPU governer to performance
> - Turn off all logging in Ceph (debug_osd, debug_ms, debug_bluestore=0)
>
> Higher clock speeds (New AMD Epyc coming in March!) help to reduce the
> latency and going towards 25Gbit/100Gbit might help as well.
>
> These are however only very small increments and might help to reduce
> the latency by another 15% or so.
>
> It doesn't bring us anywhere near the 10k IOps other applications can do.
>
> And I totally understand that replication over a TCP/IP network takes
> time and thus increases latency.
>
> The Crimson project [0] is aiming to lower the latency with many things
> like DPDK and SPDK, but this is far from finished and production ready.
>
> In the meantime, am I overseeing some things here? Can we reduce the
> latency further of the current OSDs?
>
> Reaching a ~500us latency would already be great!
>
> Thanks,
>
> Wido
>
>
> [0]: https://docs.ceph.com/en/latest/dev/crimson/crimson/
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx