I think the latency comes from journal flushing Try tuning filestore min sync interval = .1 filestore max sync interval = 5 and also /proc/sys/vm/dirty_bytes (I suggest 512MB) /proc/sys/vm/dirty_background_bytes (I suggest 256MB) See if that helps It would be useful to see the job you are running to know what exactly it does, I'm afraid your latency is not really that bad, it will scale horizontally (with number of clients) rather than vertically (higher IOPS for single blocking writes) and there's not much that can be done about that. > On 03 Mar 2016, at 14:33, RDS <rs350z@xxxxxx> wrote: > > A couple of suggestions: > 1) # of pgs per OSD should be 100-200 > 2) When dealing with SSD or Flash, performance of these devices hinge on how you partition them and how you tune linux: > a) if using partitions, did you align the partitions on a 4k boundary? I start at sector 2048 using either fdisk or sfdisk On SSD you should align at 8MB boundary (usually the erase block is quite large, though it doesn't matter that much), and the write block size is actually something like 128k 2048 aligns at 1MB which is completely fine > b) There are quite a few Linux settings that benefit SSD/Flash and they are: Deadline io scheduler only when using the deadline associated settings, up QDepth to 512 or 1024, set rq_affinity=2 if OS allows it, setting read ahead if doing majority of reads, and other those don't matter that much, higher queue depths mean larger throughput but at the expense of latency, the default are usually fine > 3) mount options: noatime, delaylog,inode64,noquota, etc… defaults work fine (noatime is a relic, relatime is what filesystems use by default nowadays) > > I have written some papers/blogs on this subject if you are interested in seeing them. > Rick >> On Mar 3, 2016, at 2:41 AM, Adrian Saul <Adrian.Saul@xxxxxxxxxxxxxxxxx> wrote: >> >> Hi Ceph-users, >> >> TL;DR - I can't seem to pin down why an unloaded system with flash based OSD journals has higher than desired write latencies for RBD devices. Any ideas? >> >> >> I am developing a storage system based on Ceph and an SCST+pacemaker cluster. Our initial testing showed promising results even with mixed available hardware and we proceeded to order a more designed platform for developing into production. The hardware is: >> >> 2x 1RU servers as "frontends" (SCST+pacemaker - ceph mons and clients using RBD - they present iSCSI to other systems). >> 3x 2RU OSD SSD servers (24 bay 2.5" SSD) - currently with 4 2TB Samsung Evo SSDs each >> 3x 4RU OSD SATA servers (36 bay) - currently with 6 8TB Seagate each >> >> As part of the research and planning we opted to put a pair of Intel PC3700DC 400G NVME cards in each OSD server. These are configured mirrored and setup as the journals for the OSD disks, the aim being to improve write latencies. All the machines have 128G RAM and dual E5-2630v3 CPUs, and use 4 aggregated 10G NICs back to a common pair of switches. All machines are running Centos 7, with the frontends using the 4.4.1 elrepo-ml kernel to get a later RBD kernel module. >> >> On the ceph side each disk in the OSD servers are setup as an individual OSD, with a 12G journal created on the flash mirror. I setup the SSD servers into one root, and the SATA servers into another and created pools using hosts as fault boundaries, with the pools set for 2 copies. I created the pools with the pg_num and pgp_num set to 32x the number of OSDs in the pool. On the frontends we create RBD devices and present them as iSCSI LUNs using SCST to clients - in this test case a Solaris host. >> >> The problem I have is that even with a lightly loaded system the service times for the LUNs for writes is just not getting down to where we want it, and they are not very stable - with 5 LUNs doing around 200 32K IOPS consistently the service times sit at around 3-4ms, but regularly (every 20-30 seconds) up to above 12-15ms which puts the average at 6ms over 5 minutes. I fully expected we would have some latencies due to the distributed and networked nature of Ceph, but in this instance I just cannot find where these latencies are coming from, especially with the SSD based pool and having flash based journaling. >> >> - The RBD devices show relatively low service times, but high queue times. These are in line with what Solaris sees so I don't think SCST/iSCSI is adding much latency. >> - The journals are reporting 0.02ms service times, and seem to cope fine with any bursts >> - The SSDs do show similar latency variations with writes - bursting up to 12ms or more whenever there is high write workloads. >> - I have tried applying what tuning I can to the SSD block devices (noop scheduler etc) - no difference >> - I have removed any sort of smarts around IO grouping in SCST - no major impact >> - I have tried tuning up filesystore queue and wbthrottle values but could not find much difference from that. >> - Read performance is excellent, the RBD devices show little to no rwait and I can do benchmarks up over 1GB/s in some tests. Write throughput can also be good (~700MB/s). >> - I have tried using different RBD orders more in line with the iSCSI client block sizes (i.e 32K, 128K instead of 4M) but it seemed to make things worse. I would have thought better alignment would reduce latency but is that offset buy the extra overhead in object work? >> >> What I am looking for is what other areas do I need to look or diagnostics do I need to work this out? We would really like to use ceph across a mixed workload that includes some DB systems that are fairly latency sensitive, but as it stands its hard to be confident in the performance when a fairly quiet unloaded system seems to struggle, even with all this hardware behind it. I get the impression that the SSD write latencies might be coming into play as they are similar to the numbers I see, but really for writes I would expect them to be "hidden" behind the journaling. >> >> I also would have thought that being not under load and with the flash journals the only latency would be coming from mapping calculations on the client or otherwise some contention within the RBD module itself. Any ideas how I can break out what the times are for what the RBD module is doing? >> >> Any help appreciated. >> >> As an aside - I think Ceph as a concept is exactly what a storage system should be about, hence why we are using it this way. Its been awesome to get stuck into it and learn how it works and what it can do. >> >> >> >> >> Adrian Saul | Infrastructure Projects Team Lead >> TPG Telecom (ASX: TPM) >> >> >> >> >> >> >> >> >> >> >> Confidentiality: This email and any attachments are confidential and may be subject to copyright, legal or some other professional privilege. They are intended solely for the attention and use of the named addressee(s). They may only be copied, distributed or disclosed with the consent of the copyright owner. If you have received this email by mistake or by breach of the confidentiality clause, please notify the sender immediately by return email and delete or destroy all copies of the email. Any confidentiality, privilege or copyright is not waived or lost because this email has been sent to you by mistake. >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com