Re: Ceph RBD latencies

Christian Balzer <chibi@xxxxxxx> · Thu, 3 Mar 2016 17:09:13 +0900

Hello,

On Thu, 3 Mar 2016 07:41:09 +0000 Adrian Saul wrote:

> Hi Ceph-users,
> 
> TL;DR - I can't seem to pin down why an unloaded system with flash based
> OSD journals has higher than desired write latencies for RBD devices.
> Any ideas?
> 
> 
>   I am developing a storage system based on Ceph and an SCST+pacemaker
> cluster.   Our initial testing showed promising results even with mixed
> available hardware and we proceeded to order a more designed platform
> for developing into production.   The hardware is:
> 
> 2x 1RU servers as "frontends" (SCST+pacemaker - ceph mons and clients
> using RBD - they present iSCSI to other systems). 3x 2RU OSD SSD servers
> (24 bay 2.5" SSD) - currently with 4 2TB Samsung Evo SSDs each 3x 4RU
> OSD SATA servers (36 bay) - currently with 6 8TB Seagate each
>
Samsung EVO... 
Which exact model, I presume this is not a DC one?

If you had put your journals on those, you would already be pulling your
hairs out due to abysmal performance.

Also with Evo ones, I'd be worried about endurance.

>  As part of the research and planning we opted to put a pair of Intel
> PC3700DC 400G NVME cards in each OSD server.  These are configured
> mirrored and setup as the journals for the OSD disks, the aim being to
> improve write latencies.  All the machines have 128G RAM and dual
> E5-2630v3 CPUs, and use 4 aggregated 10G NICs back to a common pair of
> switches.   All machines are running Centos 7, with the frontends using
> the 4.4.1 elrepo-ml kernel to get a later RBD kernel module.
> 
> On the ceph side each disk in the OSD servers are setup as an individual
> OSD, with a 12G journal created on the flash mirror.   I setup the SSD
> servers into one root, and the SATA servers into another and created
> pools using hosts as fault boundaries, with the pools set for 2
> copies.   
Risky. If you have very reliable and well monitored SSDs you can get away
with 2 (I do so), but with HDDs and the combination of their reliability
and recovery time it's asking for trouble.
I realize that this is testbed, but if your production has a replication
of 3 you will be disappointed by the additional latency.

> I created the pools with the pg_num and pgp_num set to 32x the
> number of OSDs in the pool.   On the frontends we create RBD devices and
> present them as iSCSI LUNs using SCST to clients - in this test case a
> Solaris host.
> 
> The problem I have is that even with a lightly loaded system the service
> times for the LUNs for writes is just not getting down to where we want
> it, and they are not very stable - with 5 LUNs doing around 200 32K IOPS
> consistently the service times sit at around 3-4ms, but regularly (every
> 20-30 seconds) up to above 12-15ms which puts the average at 6ms over 5
> minutes.  

This smells like garbage collection on your SSDs, especially since it
matches time wise what you saw on them below.

>I fully expected we would have some latencies due to the
> distributed and networked nature of Ceph, but in this instance I just
> cannot find where these latencies are coming from, especially with the
> SSD based pool and having flash based journaling.
> 
> - The RBD devices show relatively low service times, but high queue
> times.  These are in line with what Solaris sees so I don't think
> SCST/iSCSI is adding much latency.
> - The journals are reporting 0.02ms service times, and seem to cope fine
> with any bursts
> - The SSDs do show similar latency variations with writes - bursting up
> to 12ms or more whenever there is high write workloads.
This.

Have you tried the HDD based pool and did you see similar, consistent
interval, spikes?

Or alternatively, configured 2 of your NVMEs as OSDs?

As for monitoring, I like atop for instant feedback.
For more in-depth analysis (and for when you're not watching), collectd
with graphite serve me well.

> - I have tried applying what tuning I can to the SSD block devices (noop
> scheduler etc) - no difference
> - I have removed any sort of smarts around IO grouping in SCST - no
> major impact
> - I have tried tuning up filesystore  queue and wbthrottle values but
> could not find much difference from that.
> - Read performance is excellent, the RBD devices show little to no rwait
> and I can do benchmarks up over 1GB/s in some tests.  Write throughput
> can also be good (~700MB/s).
> - I have tried using different RBD orders more in line with the iSCSI
> client block sizes (i.e 32K, 128K instead of 4M) but it seemed to make
> things worse.  I would have thought better alignment would reduce
> latency but is that offset buy the extra overhead in object work?
> 
> What I am looking for is what other areas do I need to look or
> diagnostics do I need to work this out?  We would really like to use
> ceph across a mixed workload that includes some DB systems that are
> fairly latency sensitive, but as it stands its hard to be confident in
> the performance when a fairly quiet unloaded system seems to struggle,
> even with all this hardware behind it.   I get the impression that the
> SSD write latencies might be coming into play as they are similar to the
> numbers I see, but really for writes I would expect them to be "hidden"
> behind the journaling.
>
No, not really. The journal can only buffer so much.
There are several threads about this in the archives.

You could tune it but that will only go so far if your backing storage
can't keep up.

Regards,

Christian

> I also would have thought that being not under load and with the flash
> journals the only latency would be coming from mapping calculations on
> the client or otherwise some contention within the RBD module itself.
> Any ideas how I can break out what the times are for what the RBD module
> is doing?
> 
> Any help appreciated.
> 
> As an aside - I think Ceph as a concept is exactly what a storage system
> should be about, hence why we are using it this way.  Its been awesome
> to get stuck into it and learn how it works and what it can do.
> 
> 
> 
> 
> Adrian Saul | Infrastructure Projects Team Lead
> TPG Telecom (ASX: TPM)
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Confidentiality: This email and any attachments are confidential and may
> be subject to copyright, legal or some other professional privilege.
> They are intended solely for the attention and use of the named
> addressee(s). They may only be copied, distributed or disclosed with the
> consent of the copyright owner. If you have received this email by
> mistake or by breach of the confidentiality clause, please notify the
> sender immediately by return email and delete or destroy all copies of
> the email. Any confidentiality, privilege or copyright is not waived or
> lost because this email has been sent to you by mistake.
> _______________________________________________ ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com