You can also dump the historic ops from the OSD admin socket. It will give a brief overview of each step and how long each one is taking. But generally what you are seeing is not unusual. Currently best case for a RBD on a replicated pool will be somewhere between 200-500 iops. The Ceph code is a lot more complex than a 30cm SAS cable. CPU speed (ie Ghz not Cores) is a large factor in write latency. You may find that you can improve performance by setting the max c-state to 1 and enabling idle=poll, which stops the cores entering power saving states. I found on systems with a large number of cores unless you drive the whole box really hard, a lot of the cores clock themselves down which hurts latency. Also disable all logging in your ceph.conf, this can have quite a big effect as well. > -----Original Message----- > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of > Jan Schermer > Sent: 03 March 2016 14:38 > To: RDS <rs350z@xxxxxx> > Cc: ceph-users@xxxxxxxxxxxxxx > Subject: Re: Ceph RBD latencies > > I think the latency comes from journal flushing > > Try tuning > > filestore min sync interval = .1 > filestore max sync interval = 5 > > and also > /proc/sys/vm/dirty_bytes (I suggest 512MB) > /proc/sys/vm/dirty_background_bytes (I suggest 256MB) > > See if that helps > > It would be useful to see the job you are running to know what exactly it > does, I'm afraid your latency is not really that bad, it will scale horizontally > (with number of clients) rather than vertically (higher IOPS for single blocking > writes) and there's not much that can be done about that. > > > > On 03 Mar 2016, at 14:33, RDS <rs350z@xxxxxx> wrote: > > > > A couple of suggestions: > > 1) # of pgs per OSD should be 100-200 > > 2) When dealing with SSD or Flash, performance of these devices hinge on > how you partition them and how you tune linux: > > a) if using partitions, did you align the partitions on a 4k boundary? I > start at sector 2048 using either fdisk or sfdisk > > On SSD you should align at 8MB boundary (usually the erase block is quite > large, though it doesn't matter that much), and the write block size is actually > something like 128k > 2048 aligns at 1MB which is completely fine > > > b) There are quite a few Linux settings that benefit SSD/Flash and > they are: Deadline io scheduler only when using the deadline associated > settings, up QDepth to 512 or 1024, set rq_affinity=2 if OS allows it, setting > read ahead if doing majority of reads, and other > > those don't matter that much, higher queue depths mean larger throughput > but at the expense of latency, the default are usually fine > > > 3) mount options: noatime, delaylog,inode64,noquota, etc… > > defaults work fine (noatime is a relic, relatime is what filesystems use by > default nowadays) > > > > > I have written some papers/blogs on this subject if you are interested in > seeing them. > > Rick > >> On Mar 3, 2016, at 2:41 AM, Adrian Saul > <Adrian.Saul@xxxxxxxxxxxxxxxxx> wrote: > >> > >> Hi Ceph-users, > >> > >> TL;DR - I can't seem to pin down why an unloaded system with flash based > OSD journals has higher than desired write latencies for RBD devices. Any > ideas? > >> > >> > >> I am developing a storage system based on Ceph and an SCST+pacemaker > cluster. Our initial testing showed promising results even with mixed > available hardware and we proceeded to order a more designed platform for > developing into production. The hardware is: > >> > >> 2x 1RU servers as "frontends" (SCST+pacemaker - ceph mons and clients > using RBD - they present iSCSI to other systems). > >> 3x 2RU OSD SSD servers (24 bay 2.5" SSD) - currently with 4 2TB > >> Samsung Evo SSDs each 3x 4RU OSD SATA servers (36 bay) - currently > >> with 6 8TB Seagate each > >> > >> As part of the research and planning we opted to put a pair of Intel > PC3700DC 400G NVME cards in each OSD server. These are configured > mirrored and setup as the journals for the OSD disks, the aim being to > improve write latencies. All the machines have 128G RAM and dual E5- > 2630v3 CPUs, and use 4 aggregated 10G NICs back to a common pair of > switches. All machines are running Centos 7, with the frontends using the > 4.4.1 elrepo-ml kernel to get a later RBD kernel module. > >> > >> On the ceph side each disk in the OSD servers are setup as an individual > OSD, with a 12G journal created on the flash mirror. I setup the SSD servers > into one root, and the SATA servers into another and created pools using > hosts as fault boundaries, with the pools set for 2 copies. I created the pools > with the pg_num and pgp_num set to 32x the number of OSDs in the pool. > On the frontends we create RBD devices and present them as iSCSI LUNs > using SCST to clients - in this test case a Solaris host. > >> > >> The problem I have is that even with a lightly loaded system the service > times for the LUNs for writes is just not getting down to where we want it, > and they are not very stable - with 5 LUNs doing around 200 32K IOPS > consistently the service times sit at around 3-4ms, but regularly (every 20-30 > seconds) up to above 12-15ms which puts the average at 6ms over 5 > minutes. I fully expected we would have some latencies due to the > distributed and networked nature of Ceph, but in this instance I just cannot > find where these latencies are coming from, especially with the SSD based > pool and having flash based journaling. > >> > >> - The RBD devices show relatively low service times, but high queue > times. These are in line with what Solaris sees so I don't think SCST/iSCSI is > adding much latency. > >> - The journals are reporting 0.02ms service times, and seem to cope > >> fine with any bursts > >> - The SSDs do show similar latency variations with writes - bursting up to > 12ms or more whenever there is high write workloads. > >> - I have tried applying what tuning I can to the SSD block devices > >> (noop scheduler etc) - no difference > >> - I have removed any sort of smarts around IO grouping in SCST - no > >> major impact > >> - I have tried tuning up filesystore queue and wbthrottle values but could > not find much difference from that. > >> - Read performance is excellent, the RBD devices show little to no rwait > and I can do benchmarks up over 1GB/s in some tests. Write throughput can > also be good (~700MB/s). > >> - I have tried using different RBD orders more in line with the iSCSI client > block sizes (i.e 32K, 128K instead of 4M) but it seemed to make things worse. > I would have thought better alignment would reduce latency but is that > offset buy the extra overhead in object work? > >> > >> What I am looking for is what other areas do I need to look or diagnostics > do I need to work this out? We would really like to use ceph across a mixed > workload that includes some DB systems that are fairly latency sensitive, but > as it stands its hard to be confident in the performance when a fairly quiet > unloaded system seems to struggle, even with all this hardware behind it. I > get the impression that the SSD write latencies might be coming into play as > they are similar to the numbers I see, but really for writes I would expect > them to be "hidden" behind the journaling. > >> > >> I also would have thought that being not under load and with the flash > journals the only latency would be coming from mapping calculations on the > client or otherwise some contention within the RBD module itself. Any ideas > how I can break out what the times are for what the RBD module is doing? > >> > >> Any help appreciated. > >> > >> As an aside - I think Ceph as a concept is exactly what a storage system > should be about, hence why we are using it this way. Its been awesome to > get stuck into it and learn how it works and what it can do. > >> > >> > >> > >> > >> Adrian Saul | Infrastructure Projects Team Lead TPG Telecom (ASX: > >> TPM) > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> Confidentiality: This email and any attachments are confidential and may > be subject to copyright, legal or some other professional privilege. They are > intended solely for the attention and use of the named addressee(s). They > may only be copied, distributed or disclosed with the consent of the > copyright owner. If you have received this email by mistake or by breach of > the confidentiality clause, please notify the sender immediately by return > email and delete or destroy all copies of the email. Any confidentiality, > privilege or copyright is not waived or lost because this email has been sent > to you by mistake. > >> _______________________________________________ > >> ceph-users mailing list > >> ceph-users@xxxxxxxxxxxxxx > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com