Re: Ceph RBD latencies

Nick Fisk <nick@xxxxxxxxxx> · Thu, 3 Mar 2016 17:53:26 -0000

You can also dump the historic ops from the OSD admin socket. It will give a brief overview of each step and how long each one is taking.

But generally what you are seeing is not unusual. Currently best case for a RBD on a replicated pool will be somewhere between 200-500 iops. The Ceph code is a lot more complex than a 30cm SAS cable.

CPU speed (ie Ghz not Cores) is a large factor in write latency. You may find that you can improve performance by setting the max c-state to 1 and enabling idle=poll, which stops the cores entering power saving states. I found on systems with a large number of cores unless you drive the whole box really hard, a lot of the cores clock themselves down which hurts latency.

Also disable all logging in your ceph.conf, this can have quite a big effect as well.

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Jan Schermer
> Sent: 03 March 2016 14:38
> To: RDS <rs350z@xxxxxx>
> Cc: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  Ceph RBD latencies
> 
> I think the latency comes from journal flushing
> 
> Try tuning
> 
> filestore min sync interval = .1
> filestore max sync interval = 5
> 
> and also
> /proc/sys/vm/dirty_bytes (I suggest 512MB)
> /proc/sys/vm/dirty_background_bytes (I suggest 256MB)
> 
> See if that helps
> 
> It would be useful to see the job you are running to know what exactly it
> does, I'm afraid your latency is not really that bad, it will scale horizontally
> (with number of clients) rather than vertically (higher IOPS for single blocking
> writes) and there's not much that can be done about that.
> 
> 
> > On 03 Mar 2016, at 14:33, RDS <rs350z@xxxxxx> wrote:
> >
> > A couple of suggestions:
> > 1)   # of pgs per OSD should be 100-200
> > 2)  When dealing with SSD or Flash, performance of these devices hinge on
> how you partition them and how you tune linux:
> > 	a)   if using partitions, did you align the partitions on a 4k boundary? I
> start at sector 2048 using either fdisk or sfdisk
> 
> On SSD you should align at 8MB boundary (usually the erase block is quite
> large, though it doesn't matter that much), and the write block size is actually
> something like 128k
> 2048 aligns at 1MB which is completely fine
> 
> > 	b)   There are quite a few Linux settings that benefit SSD/Flash and
> they are: Deadline io scheduler only when using the deadline associated
> settings, up      QDepth to 512 or 1024, set rq_affinity=2 if OS allows it, setting
> read ahead if doing majority of reads, and other
> 
> those don't matter that much, higher queue depths mean larger throughput
> but at the expense of latency, the default are usually fine
> 
> > 3)   mount options:  noatime, delaylog,inode64,noquota, etc…
> 
> defaults work fine (noatime is a relic, relatime is what filesystems use by
> default nowadays)
> 
> >
> > I have written some papers/blogs on this subject if you are interested in
> seeing them.
> > Rick
> >> On Mar 3, 2016, at 2:41 AM, Adrian Saul
> <Adrian.Saul@xxxxxxxxxxxxxxxxx> wrote:
> >>
> >> Hi Ceph-users,
> >>
> >> TL;DR - I can't seem to pin down why an unloaded system with flash based
> OSD journals has higher than desired write latencies for RBD devices.  Any
> ideas?
> >>
> >>
> >> I am developing a storage system based on Ceph and an SCST+pacemaker
> cluster.   Our initial testing showed promising results even with mixed
> available hardware and we proceeded to order a more designed platform for
> developing into production.   The hardware is:
> >>
> >> 2x 1RU servers as "frontends" (SCST+pacemaker - ceph mons and clients
> using RBD - they present iSCSI to other systems).
> >> 3x 2RU OSD SSD servers (24 bay 2.5" SSD) - currently with 4 2TB
> >> Samsung Evo SSDs each 3x 4RU OSD SATA servers (36 bay) - currently
> >> with 6 8TB Seagate each
> >>
> >> As part of the research and planning we opted to put a pair of Intel
> PC3700DC 400G NVME cards in each OSD server.  These are configured
> mirrored and setup as the journals for the OSD disks, the aim being to
> improve write latencies.  All the machines have 128G RAM and dual E5-
> 2630v3 CPUs, and use 4 aggregated 10G NICs back to a common pair of
> switches.   All machines are running Centos 7, with the frontends using the
> 4.4.1 elrepo-ml kernel to get a later RBD kernel module.
> >>
> >> On the ceph side each disk in the OSD servers are setup as an individual
> OSD, with a 12G journal created on the flash mirror.   I setup the SSD servers
> into one root, and the SATA servers into another and created pools using
> hosts as fault boundaries, with the pools set for 2 copies.   I created the pools
> with the pg_num and pgp_num set to 32x the number of OSDs in the pool.
> On the frontends we create RBD devices and present them as iSCSI LUNs
> using SCST to clients - in this test case a Solaris host.
> >>
> >> The problem I have is that even with a lightly loaded system the service
> times for the LUNs for writes is just not getting down to where we want it,
> and they are not very stable - with 5 LUNs doing around 200 32K IOPS
> consistently the service times sit at around 3-4ms, but regularly (every 20-30
> seconds) up to above 12-15ms which puts the average at 6ms over 5
> minutes.  I fully expected we would have some latencies due to the
> distributed and networked nature of Ceph, but in this instance I just cannot
> find where these latencies are coming from, especially with the SSD based
> pool and having flash based journaling.
> >>
> >> - The RBD devices show relatively low service times, but high queue
> times.  These are in line with what Solaris sees so I don't think SCST/iSCSI is
> adding much latency.
> >> - The journals are reporting 0.02ms service times, and seem to cope
> >> fine with any bursts
> >> - The SSDs do show similar latency variations with writes - bursting up to
> 12ms or more whenever there is high write workloads.
> >> - I have tried applying what tuning I can to the SSD block devices
> >> (noop scheduler etc) - no difference
> >> - I have removed any sort of smarts around IO grouping in SCST - no
> >> major impact
> >> - I have tried tuning up filesystore  queue and wbthrottle values but could
> not find much difference from that.
> >> - Read performance is excellent, the RBD devices show little to no rwait
> and I can do benchmarks up over 1GB/s in some tests.  Write throughput can
> also be good (~700MB/s).
> >> - I have tried using different RBD orders more in line with the iSCSI client
> block sizes (i.e 32K, 128K instead of 4M) but it seemed to make things worse.
> I would have thought better alignment would reduce latency but is that
> offset buy the extra overhead in object work?
> >>
> >> What I am looking for is what other areas do I need to look or diagnostics
> do I need to work this out?  We would really like to use ceph across a mixed
> workload that includes some DB systems that are fairly latency sensitive, but
> as it stands its hard to be confident in the performance when a fairly quiet
> unloaded system seems to struggle, even with all this hardware behind it.   I
> get the impression that the SSD write latencies might be coming into play as
> they are similar to the numbers I see, but really for writes I would expect
> them to be "hidden" behind the journaling.
> >>
> >> I also would have thought that being not under load and with the flash
> journals the only latency would be coming from mapping calculations on the
> client or otherwise some contention within the RBD module itself.   Any ideas
> how I can break out what the times are for what the RBD module is doing?
> >>
> >> Any help appreciated.
> >>
> >> As an aside - I think Ceph as a concept is exactly what a storage system
> should be about, hence why we are using it this way.  Its been awesome to
> get stuck into it and learn how it works and what it can do.
> >>
> >>
> >>
> >>
> >> Adrian Saul | Infrastructure Projects Team Lead TPG Telecom (ASX:
> >> TPM)
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> Confidentiality: This email and any attachments are confidential and may
> be subject to copyright, legal or some other professional privilege. They are
> intended solely for the attention and use of the named addressee(s). They
> may only be copied, distributed or disclosed with the consent of the
> copyright owner. If you have received this email by mistake or by breach of
> the confidentiality clause, please notify the sender immediately by return
> email and delete or destroy all copies of the email. Any confidentiality,
> privilege or copyright is not waived or lost because this email has been sent
> to you by mistake.
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com