День добрий! Thu, Feb 11, 2021 at 04:00:31PM +0100, joachim.kraftmayer wrote: > Hi Wido, > > do you know what happened to mellanox's ceph rdma project of 2018? We tested ceph/rdma on Mellanox ConnectX-4 Lx during one year and saw no visible benefits. But it was strange connection outages betwen OSD leading to slow ops and service outages. So we returned to good old posix. > We will test ARM Ampere for all-flash this half-year and probably get the > opportunity to experiment with software defined memory. > > Regards, Joachim > > ___________________________________ > > Clyso GmbH > > Am 08.02.2021 um 14:21 schrieb Paul Emmerich: > > A few things that you can try on the network side to shave off microseconds: > > > > 1) 10G Base-T has quite some latency compared to fiber or DAC. I've > > measured 2 µs on Base-T vs. 0.3µs on fiber for one link in one direction, > > so that's 8µs you can save for a round-trip if it's client -> switch -> osd > > and back. Note that my measurement was for small packets, not sure how big > > that penalty still is with large packets. Some of it comes from the large > > block size (~3 kbit IIRC) of the layer 1 encoding, some is just processing > > time of that complex encoding. > > > > 2) Setting the switch to cut-through instead of store-and-forward can help, > > especially on slower links. Serialization time is 0.8ns per byte on 10 > > gbit, so ~3.2µs for a 4kb packet. > > > > 3) Depending on which NIC you use: check if it has some kind of interrupt > > throttling feature that you can adjust or disable. If your Base-T NIC is an > > Intel NIC, especially on the older Niantic ones (i.e. X5xx X5xx using ixgbe > > probably also X7xx, i40e), that can make a large difference. Try setting > > itr=0 for the ixgbe kernel module. Note that you might want to compile your > > kernel with CONFIG_IRQ_TIME_ACCOUNTING when using this option, otherwise > > CPU usage statistics will be wildly inaccurate if the driver takes a > > significant amount of CPU time (should not be a problem for the setup > > described here, but something to be aware of). This may get you up to 100µs > > in the best case. No idea about other NICs > > > > 4) No idea about the state in Ceph, but: SO_BUSY_POLL on sockets does help > > with latency, but I forgot the details > > > > 5) Correct NUMA pinning (a single socket AMD system is NUMA) can reduce > > tail latency, but doesn't do anything for average and median latency and I > > have no insights specific to Ceph, though. > > > > > > This could get you a few microseconds, I think especially 3 and 4 are worth > > trying. Please do report results if you test this, I'm always interested in > > hearing stories about low-level performance optimizations :) > > > > Paul > > > > > > > > On Tue, Feb 2, 2021 at 10:17 AM Wido den Hollander <wido@xxxxxxxx> wrote: > > > > > Hi, > > > > > > There are many talks and presentations out there about Ceph's > > > performance. Ceph is great when it comes to parallel I/O, large queue > > > depths and many applications sending I/O towards Ceph. > > > > > > One thing where Ceph isn't the fastest are 4k blocks written at Queue > > > Depth 1. > > > > > > Some applications benefit very much from high performance/low latency > > > I/O at qd=1, for example Single Threaded applications which are writing > > > small files inside a VM running on RBD. > > > > > > With some tuning you can get to a ~700us latency for a 4k write with > > > qd=1 (Replication, size=3) > > > > > > I benchmark this using fio: > > > > > > $ fio --ioengine=librbd --bs=4k --iodepth=1 --direct=1 .. .. .. .. > > > > > > 700us latency means the result will be about ~1500 IOps (1000 / 0.7) > > > > > > When comparing this to let's say a BSD machine running ZFS that's on the > > > low side. With ZFS+NVMe you'll be able to reach about somewhere between > > > 7.000 and 10.000 IOps, the latency is simply much lower. > > > > > > My benchmarking / test setup for this: > > > > > > - Ceph Nautilus/Octopus (doesn't make a big difference) > > > - 3x SuperMicro 1U with: > > > - AMD Epyc 7302P 16-core CPU > > > - 128GB DDR4 > > > - 10x Samsung PM983 3,84TB > > > - 10Gbit Base-T networking > > > > > > Things to configure/tune: > > > > > > - C-State pinning to 1 > > > - CPU governer to performance > > > - Turn off all logging in Ceph (debug_osd, debug_ms, debug_bluestore=0) > > > > > > Higher clock speeds (New AMD Epyc coming in March!) help to reduce the > > > latency and going towards 25Gbit/100Gbit might help as well. > > > > > > These are however only very small increments and might help to reduce > > > the latency by another 15% or so. > > > > > > It doesn't bring us anywhere near the 10k IOps other applications can do. > > > > > > And I totally understand that replication over a TCP/IP network takes > > > time and thus increases latency. > > > > > > The Crimson project [0] is aiming to lower the latency with many things > > > like DPDK and SPDK, but this is far from finished and production ready. > > > > > > In the meantime, am I overseeing some things here? Can we reduce the > > > latency further of the current OSDs? > > > > > > Reaching a ~500us latency would already be great! > > > > > > Thanks, > > > > > > Wido > > > > > > > > > [0]: https://docs.ceph.com/en/latest/dev/crimson/crimson/ > > > _______________________________________________ > > > ceph-users mailing list -- ceph-users@xxxxxxx > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx