Re: Increasing QD=1 performance (lowering latency)

Max Krasilnikov <pseudo@xxxxxxxxxxxxx> · Fri, 12 Feb 2021 11:00:41 +0000

День добрий! 

 Thu, Feb 11, 2021 at 04:00:31PM +0100, joachim.kraftmayer wrote: 

> Hi Wido,
> 
> do you know what happened to mellanox's ceph rdma project of 2018?

We tested ceph/rdma on Mellanox ConnectX-4 Lx during one year and saw no visible
benefits. But it was strange connection outages betwen OSD leading to slow ops
and service outages. So we returned to good old posix.

> We will test ARM Ampere for all-flash this half-year and probably get the
> opportunity to experiment with software defined memory.
> 
> Regards, Joachim
> 
> ___________________________________
> 
> Clyso GmbH
> 
> Am 08.02.2021 um 14:21 schrieb Paul Emmerich:
> > A few things that you can try on the network side to shave off microseconds:
> > 
> > 1) 10G Base-T has quite some latency compared to fiber or DAC. I've
> > measured 2 µs on Base-T vs. 0.3µs on fiber for one link in one direction,
> > so that's 8µs you can save for a round-trip if it's client -> switch -> osd
> > and back. Note that my measurement was for small packets, not sure how big
> > that penalty still is with large packets. Some of it comes from the large
> > block size (~3 kbit IIRC) of the layer 1 encoding, some is just processing
> > time of that complex encoding.
> > 
> > 2) Setting the switch to cut-through instead of store-and-forward can help,
> > especially on slower links. Serialization time is 0.8ns per byte on 10
> > gbit, so ~3.2µs for a 4kb packet.
> > 
> > 3) Depending on which NIC you use: check if it has some kind of interrupt
> > throttling feature that you can adjust or disable. If your Base-T NIC is an
> > Intel NIC, especially on the older Niantic ones (i.e. X5xx X5xx using ixgbe
> > probably also X7xx, i40e), that can make a large difference. Try setting
> > itr=0 for the ixgbe kernel module. Note that you might want to compile your
> > kernel with CONFIG_IRQ_TIME_ACCOUNTING when using this option, otherwise
> > CPU usage statistics will be wildly inaccurate if the driver takes a
> > significant amount of CPU time (should not be a problem for the setup
> > described here, but something to be aware of). This may get you up to 100µs
> > in the best case. No idea about other NICs
> > 
> > 4) No idea about the state in Ceph, but: SO_BUSY_POLL on sockets does help
> > with latency, but I forgot the details
> > 
> > 5) Correct NUMA pinning (a single socket AMD system is NUMA) can reduce
> > tail latency, but doesn't do anything for average and median latency and I
> > have no insights specific to Ceph, though.
> > 
> > 
> > This could get you a few microseconds, I think especially 3 and 4 are worth
> > trying. Please do report results if you test this, I'm always interested in
> > hearing stories about low-level performance optimizations :)
> > 
> > Paul
> > 
> > 
> > 
> > On Tue, Feb 2, 2021 at 10:17 AM Wido den Hollander <wido@xxxxxxxx> wrote:
> > 
> > > Hi,
> > > 
> > > There are many talks and presentations out there about Ceph's
> > > performance. Ceph is great when it comes to parallel I/O, large queue
> > > depths and many applications sending I/O towards Ceph.
> > > 
> > > One thing where Ceph isn't the fastest are 4k blocks written at Queue
> > > Depth 1.
> > > 
> > > Some applications benefit very much from high performance/low latency
> > > I/O at qd=1, for example Single Threaded applications which are writing
> > > small files inside a VM running on RBD.
> > > 
> > > With some tuning you can get to a ~700us latency for a 4k write with
> > > qd=1 (Replication, size=3)
> > > 
> > > I benchmark this using fio:
> > > 
> > > $ fio --ioengine=librbd --bs=4k --iodepth=1 --direct=1 .. .. .. ..
> > > 
> > > 700us latency means the result will be about ~1500 IOps (1000 / 0.7)
> > > 
> > > When comparing this to let's say a BSD machine running ZFS that's on the
> > > low side. With ZFS+NVMe you'll be able to reach about somewhere between
> > > 7.000 and 10.000 IOps, the latency is simply much lower.
> > > 
> > > My benchmarking / test setup for this:
> > > 
> > > - Ceph Nautilus/Octopus (doesn't make a big difference)
> > > - 3x SuperMicro 1U with:
> > > - AMD Epyc 7302P 16-core CPU
> > > - 128GB DDR4
> > > - 10x Samsung PM983 3,84TB
> > > - 10Gbit Base-T networking
> > > 
> > > Things to configure/tune:
> > > 
> > > - C-State pinning to 1
> > > - CPU governer to performance
> > > - Turn off all logging in Ceph (debug_osd, debug_ms, debug_bluestore=0)
> > > 
> > > Higher clock speeds (New AMD Epyc coming in March!) help to reduce the
> > > latency and going towards 25Gbit/100Gbit might help as well.
> > > 
> > > These are however only very small increments and might help to reduce
> > > the latency by another 15% or so.
> > > 
> > > It doesn't bring us anywhere near the 10k IOps other applications can do.
> > > 
> > > And I totally understand that replication over a TCP/IP network takes
> > > time and thus increases latency.
> > > 
> > > The Crimson project [0] is aiming to lower the latency with many things
> > > like DPDK and SPDK, but this is far from finished and production ready.
> > > 
> > > In the meantime, am I overseeing some things here? Can we reduce the
> > > latency further of the current OSDs?
> > > 
> > > Reaching a ~500us latency would already be great!
> > > 
> > > Thanks,
> > > 
> > > Wido
> > > 
> > > 
> > > [0]: https://docs.ceph.com/en/latest/dev/crimson/crimson/
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > > 
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx