Hey Kenneth, it looks like you¹re just down the tollroad from me. I¹m in Reston Town Center. Just as a really rough estimate, I¹d say this is your max IOPS: 80 IOPS/spinner * 6 drives / 3 replicas = 160ish max sustained IOPS It¹s more complicated than that, since you have a reasonable solid state journal, lots of memory, etc, but that¹s a guess, since the backend will eventually need to keep up. That being said, almost every time I have seen blocked requests, there is some other underlying issue. I would say start with implementation checks: - checking connectivity between OSDs, with and without LACP (overkill for your purposes) - ensuring that the OSDs target drives are actually mounted instead of scribbling to the root drive - ensuring that the journal is properly implemented - all OSDs on the same version - Any OSDs crashing? - packet fragmentation? We have to stick with 1500 MTU to prevent frags. Don¹t assume you can run jumbo - You¹re not running much traffic, so a short capture on both sides and wireshark should reveal any obvious issues Is there anything in the ceph.log from a mon host? Grep for WRN. Also look at the individual OSD log. This seems more like an implementation issue. Happy to help out a local if you need more. -- Warren Wang Comcast Cloud (OpenStack) On 8/31/15, 1:28 PM, "ceph-users on behalf of Kenneth Van Alstyne" <ceph-users-bounces@xxxxxxxxxxxxxx on behalf of kvanalstyne@xxxxxxxxxxxxxxx> wrote: >Christian, et al: > >Sorry for the lack of information. I wasn¹t sure what of our hardware >specifications or Ceph configuration was useful information at this >point. Thanks for the feedback ‹ any feedback, is appreciated at this >point, as I¹ve been beating my head against a wall trying to figure out >what¹s going on. (If anything. Maybe the spindle count is indeed our >upper limit or our SSDs really suck? :-) ) > >To directly address your questions, see answers below: > - CBT is the Ceph Benchmarking Tool. Since my question was more generic >rather than with CBT itself, it was probably more useful to post in the >ceph-users list rather than cbt. > - 8 Cores are from 2x quad core Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz > - The SSDs are indeed Intel S3500s. I agree ‹ not ideal, but supposedly >capable of up to 75,000 random 4KB reads/writes. Throughput and >longevity is quite low for an SSD, rated at about 400MB/s reads and >100MB/s writes, though. When we added these as journals in front of the >SATA spindles, both VM performance and rados benchmark numbers were >relatively unchanged. > > - Regarding throughput vs iops, indeed ‹ the throughput that I¹m seeing >is nearly worst case scenario, with all I/O being 4KB block size. With >RBD cache enabled and the writeback option set in the VM configuration, I >was hoping more coalescing would occur, increasing the I/O block size. > >As an aside, the orchestration layer on top of KVM is OpenNebula if >that¹s of any interest. > >VM information: > - Number = 15 > - Worload = Mixed (I know, I know ‹ that¹s as vague of an answer as they >come) A handful of VMs are running some MySQL databases and some web >applications in Apache Tomcat. One is running a syslog server. >Everything else is mostly static web page serving for a low number of >users. > >I can duplicate the blocked request issue pretty consistently, just by >running something simple like a ³yum -y update² in one VM. While that is >running, ceph -w and ceph -s show the following: >root@dashboard:~# ceph -s > cluster f79d8c2a-3c14-49be-942d-83fc5f193a25 > health HEALTH_WARN > 1 requests are blocked > 32 sec > monmap e3: 3 mons at >{storage-1=10.0.0.1:6789/0,storage-2=10.0.0.2:6789/0,storage-3=10.0.0.3:67 >89/0} > election epoch 136, quorum 0,1,2 storage-1,storage-2,storage-3 > osdmap e75590: 6 osds: 6 up, 6 in > pgmap v3495103: 224 pgs, 1 pools, 826 GB data, 225 kobjects > 2700 GB used, 2870 GB / 5571 GB avail > 224 active+clean > client io 3292 B/s rd, 2623 kB/s wr, 81 op/s > >2015-08-31 16:39:46.490696 mon.0 [INF] pgmap v3495096: 224 pgs: 224 >active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail >2015-08-31 16:39:47.789982 mon.0 [INF] pgmap v3495097: 224 pgs: 224 >active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 0 B/s >rd, 517 kB/s wr, 130 op/s >2015-08-31 16:39:49.239033 mon.0 [INF] pgmap v3495098: 224 pgs: 224 >active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 0 B/s >rd, 474 kB/s wr, 128 op/s >2015-08-31 16:39:51.970679 mon.0 [INF] pgmap v3495099: 224 pgs: 224 >active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 0 B/s >rd, 58662 B/s wr, 22 op/s >2015-08-31 16:39:57.267697 mon.0 [INF] pgmap v3495100: 224 pgs: 224 >active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 11357 >B/s wr, 5 op/s >2015-08-31 16:39:58.700312 mon.0 [INF] pgmap v3495101: 224 pgs: 224 >active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 1911 >B/s rd, 701 kB/s wr, 19 op/s >2015-08-31 16:39:59.999624 mon.0 [INF] pgmap v3495102: 224 pgs: 224 >active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 4247 >B/s rd, 3092 kB/s wr, 66 op/s >2015-08-31 16:40:02.156758 mon.0 [INF] pgmap v3495103: 224 pgs: 224 >active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 3292 >B/s rd, 2623 kB/s wr, 81 op/s >2015-08-31 16:40:03.289101 mon.0 [INF] pgmap v3495104: 224 pgs: 224 >active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 65664 >B/s rd, 2163 kB/s wr, 76 op/s >2015-08-31 16:40:04.679926 mon.0 [INF] pgmap v3495105: 224 pgs: 224 >active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 90075 >B/s rd, 3158 kB/s wr, 34 op/s >2015-08-31 16:40:07.237293 mon.0 [INF] pgmap v3495106: 224 pgs: 224 >active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 0 B/s >rd, 1899 kB/s wr, 29 op/s >2015-08-31 16:40:08.303615 mon.0 [INF] pgmap v3495107: 224 pgs: 224 >active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 259 >kB/s rd, 2864 kB/s wr, 77 op/s >2015-08-31 16:40:09.352817 mon.0 [INF] pgmap v3495108: 224 pgs: 224 >active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 411 >kB/s rd, 4093 kB/s wr, 115 op/s >2015-08-31 16:40:11.951104 mon.0 [INF] pgmap v3495109: 224 pgs: 224 >active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 466 >kB/s rd, 1863 kB/s wr, 148 op/s > >I never seem to get anywhere near 300 op/s. If spindle count is indeed >the problem, is there anything else I can do to improve caching or I/O >coalescing to deal with my crippling IOP limit due to the low number of >spindles? > >Thanks, > >-- >Kenneth Van Alstyne >Systems Architect >Knight Point Systems, LLC >Service-Disabled Veteran-Owned Business >1775 Wiehle Avenue Suite 101 | Reston, VA 20190 >c: 228-547-8045 f: 571-266-3106 >www.knightpoint.com >DHS EAGLE II Prime Contractor: FC1 SDVOSB Track >GSA Schedule 70 SDVOSB: GS-35F-0646S >GSA MOBIS Schedule: GS-10F-0404Y >ISO 20000 / ISO 27001 > >Notice: This e-mail message, including any attachments, is for the sole >use of the intended recipient(s) and may contain confidential and >privileged information. Any unauthorized review, copy, use, disclosure, >or distribution is STRICTLY prohibited. If you are not the intended >recipient, please contact the sender by reply e-mail and destroy all >copies of the original message. > >> On Aug 31, 2015, at 11:01 AM, Christian Balzer <chibi@xxxxxxx> wrote: >> >> >> Hello, >> >> On Mon, 31 Aug 2015 08:31:57 -0500 Kenneth Van Alstyne wrote: >> >>> Sorry about the repost from the cbt list, but it was suggested I post >>> here as well: >>> >> I wasn't even aware a CBT (what the heck does that acronym stand for?) >> existed... >> >>> I am attempting to track down some performance issues in a Ceph cluster >>> recently deployed. Our configuration is as follows: 3 storage nodes, >> 3 nodes is, of course, bare minimum. >> >>> each with: >>> - 8 Cores >> Of what, apples? Detailed information makes for better replies. >> >>> - 64GB of RAM >> Ample. >> >>> - 2x 1TB 7200 RPM Spindle >> Even if your cores where to be rotten apple ones, that's very few >> spindles, so your CPU is unlikely to be the bottleneck. >> >>> - 1x 120GB Intel SSD >> Details, again. From your P.S. I conclude that these are S3500's, >> definitely not my choice for journals when it comes to speed and >>endurance. >> >>> - 2x 10GBit NICs (In LACP Port-channel) >> Massively overspec'ed considering your storage sinks/wells aka HDDs. >> >>> >>> The OSD pool min_size is set to ³1² and ³size² is set to ³3². When >>> creating a new pool and running RADOS benchmarks, performance isn¹t bad >>> ‹ about what I would expect from this hardware configuration: >>> >> Rados bench uses by default 4MB "blocks", which is the optimum size for >> (default) RBD pools. >> Bandwidth does not equal IOPS (which are commonly measured in 4KB >>blocks). >> >>> WRITES: >>> Total writes made: 207 >>> Write size: 4194304 >>> Bandwidth (MB/sec): 80.017 >>> >>> Stddev Bandwidth: 34.9212 >>> Max bandwidth (MB/sec): 120 >>> Min bandwidth (MB/sec): 0 >>> Average Latency: 0.797667 >>> Stddev Latency: 0.313188 >>> Max latency: 1.72237 >>> Min latency: 0.253286 >>> >>> RAND READS: >>> Total time run: 10.127990 >>> Total reads made: 1263 >>> Read size: 4194304 >>> Bandwidth (MB/sec): 498.816 >>> >>> Average Latency: 0.127821 >>> Max latency: 0.464181 >>> Min latency: 0.0220425 >>> >>> This all looks fine, until we try to use the cluster for its purpose, >>> which is to house images for qemu-kvm, which are access using librbd. >> Not that it probably matters, but knowing if this Openstack, Ganeti or >> something else might be of interest. >> >>> I/O inside VMs have excessive I/O wait times (in the hundreds of ms at >>> times, making some operating systems, like Windows unusable) and >>> throughput struggles to exceed 10MB/s (or less). Looking at ceph >>> health, we see very low op/s numbers as well as throughput and the >>> requests blocked number seems very high. Any ideas as to what to look >>> at here? >>> >> Again, details. >> >> How many VMs? >> What are they doing? >> Keep in mind that the BEST sustained result you could hope for here >> (ignoring Ceph overhead and network latency) is the IOPS of 2 HDDs, so >> about 300 IOPS at best. TOTAL. >> >>> health HEALTH_WARN >>> 8 requests are blocked > 32 sec >>> monmap e3: 3 mons at >>> >>>{storage-1=10.0.0.1:6789/0,storage-2=10.0.0.2:6789/0,storage-3=10.0.0.3: >>>6789/0} >>> election epoch 128, quorum 0,1,2 storage-1,storage-2,storage-3 osdmap >>> e69615: 6 osds: 6 up, 6 in pgmap v3148541: 224 pgs, 1 pools, 819 GB >> 256 or 512 PGs would have been the "correct" number here, but that's of >> little importance. >> >>> data, 227 kobjects 2726 GB used, 2844 GB / 5571 GB avail >>> 224 active+clean >>> client io 3957 B/s rd, 3494 kB/s wr, 30 op/s >>> >> That's a lot of data being written for a tiny cluster like yours. >> Looking at your nodes with atop or similar tools will likely reveal that >> your HDDs are quite the busy beavers and can't keep up. >> >> Also prolonged values from "ceph -w" might be educational. >> >> Regards, >> >> Christian >> >>> Of note, on the other list, I was asked to provide the following: >>> - ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) >>> - The SSD is split into 8GB partitions. These 8GB partitions are >>> used as journal devices, specified in /etc/ceph/ceph.conf. For >>>example: >>> [osd.0] host = storage-1 >>> osd journal >>> = /dev/mapper/INTEL_SSDSC2BB120G4_CVWL4363006R120LGNp1 >>> - rbd_cache is enabled and qemu cache is set to ³writeback" >>> - rbd_concurrent_management_ops is unset, so it appears the >>> default is ³10² >>> >>> Thanks, >>> >>> -- >>> Kenneth Van Alstyne >>> Systems Architect >>> Knight Point Systems, LLC >>> Service-Disabled Veteran-Owned Business >>> 1775 Wiehle Avenue Suite 101 | Reston, VA 20190 >>> c: 228-547-8045 f: 571-266-3106 >>> www.knightpoint.com >>> DHS EAGLE II Prime Contractor: FC1 SDVOSB Track >>> GSA Schedule 70 SDVOSB: GS-35F-0646S >>> GSA MOBIS Schedule: GS-10F-0404Y >>> ISO 20000 / ISO 27001 >>> >>> Notice: This e-mail message, including any attachments, is for the sole >>> use of the intended recipient(s) and may contain confidential and >>> privileged information. Any unauthorized review, copy, use, disclosure, >>> or distribution is STRICTLY prohibited. If you are not the intended >>> recipient, please contact the sender by reply e-mail and destroy all >>> copies of the original message. >>> >> >> >> -- >> Christian Balzer Network/Systems Engineer >> chibi@xxxxxxx Global OnLine Japan/Fusion Communications >> http://www.gol.com/ > >_______________________________________________ >ceph-users mailing list >ceph-users@xxxxxxxxxxxxxx >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com