Ceph RBD latencies

Adrian Saul <Adrian.Saul@xxxxxxxxxxxxxxxxx> · Thu, 3 Mar 2016 07:41:09 +0000

Hi Ceph-users,

TL;DR - I can't seem to pin down why an unloaded system with flash based OSD journals has higher than desired write latencies for RBD devices.  Any ideas?

  I am developing a storage system based on Ceph and an SCST+pacemaker cluster.   Our initial testing showed promising results even with mixed available hardware and we proceeded to order a more designed platform for developing into production.   The hardware is:

2x 1RU servers as "frontends" (SCST+pacemaker - ceph mons and clients using RBD - they present iSCSI to other systems).
3x 2RU OSD SSD servers (24 bay 2.5" SSD) - currently with 4 2TB Samsung Evo SSDs each
3x 4RU OSD SATA servers (36 bay) - currently with 6 8TB Seagate each

 As part of the research and planning we opted to put a pair of Intel PC3700DC 400G NVME cards in each OSD server.  These are configured mirrored and setup as the journals for the OSD disks, the aim being to improve write latencies.  All the machines have 128G RAM and dual E5-2630v3 CPUs, and use 4 aggregated 10G NICs back to a common pair of switches.   All machines are running Centos 7, with the frontends using the 4.4.1 elrepo-ml kernel to get a later RBD kernel module.

On the ceph side each disk in the OSD servers are setup as an individual OSD, with a 12G journal created on the flash mirror.   I setup the SSD servers into one root, and the SATA servers into another and created pools using hosts as fault boundaries, with the pools set for 2 copies.   I created the pools with the pg_num and pgp_num set to 32x the number of OSDs in the pool.   On the frontends we create RBD devices and present them as iSCSI LUNs using SCST to clients - in this test case a Solaris host.

The problem I have is that even with a lightly loaded system the service times for the LUNs for writes is just not getting down to where we want it, and they are not very stable - with 5 LUNs doing around 200 32K IOPS consistently the service times sit at around 3-4ms, but regularly (every 20-30 seconds) up to above 12-15ms which puts the average at 6ms over 5 minutes.  I fully expected we would have some latencies due to the distributed and networked nature of Ceph, but in this instance I just cannot find where these latencies are coming from, especially with the SSD based pool and having flash based journaling.

- The RBD devices show relatively low service times, but high queue times.  These are in line with what Solaris sees so I don't think SCST/iSCSI is adding much latency.
- The journals are reporting 0.02ms service times, and seem to cope fine with any bursts
- The SSDs do show similar latency variations with writes - bursting up to 12ms or more whenever there is high write workloads.
- I have tried applying what tuning I can to the SSD block devices (noop scheduler etc) - no difference
- I have removed any sort of smarts around IO grouping in SCST - no major impact
- I have tried tuning up filesystore  queue and wbthrottle values but could not find much difference from that.
- Read performance is excellent, the RBD devices show little to no rwait and I can do benchmarks up over 1GB/s in some tests.  Write throughput can also be good (~700MB/s).
- I have tried using different RBD orders more in line with the iSCSI client block sizes (i.e 32K, 128K instead of 4M) but it seemed to make things worse.  I would have thought better alignment would reduce latency but is that offset buy the extra overhead in object work?

What I am looking for is what other areas do I need to look or diagnostics do I need to work this out?  We would really like to use ceph across a mixed workload that includes some DB systems that are fairly latency sensitive, but as it stands its hard to be confident in the performance when a fairly quiet unloaded system seems to struggle, even with all this hardware behind it.   I get the impression that the SSD write latencies might be coming into play as they are similar to the numbers I see, but really for writes I would expect them to be "hidden" behind the journaling.

I also would have thought that being not under load and with the flash journals the only latency would be coming from mapping calculations on the client or otherwise some contention within the RBD module itself.   Any ideas how I can break out what the times are for what the RBD module is doing?

Any help appreciated.

As an aside - I think Ceph as a concept is exactly what a storage system should be about, hence why we are using it this way.  Its been awesome to get stuck into it and learn how it works and what it can do.

Adrian Saul | Infrastructure Projects Team Lead
TPG Telecom (ASX: TPM)

Confidentiality: This email and any attachments are confidential and may be subject to copyright, legal or some other professional privilege. They are intended solely for the attention and use of the named addressee(s). They may only be copied, distributed or disclosed with the consent of the copyright owner. If you have received this email by mistake or by breach of the confidentiality clause, please notify the sender immediately by return email and delete or destroy all copies of the email. Any confidentiality, privilege or copyright is not waived or lost because this email has been sent to you by mistake.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com