Re: Ceph Performance Questions with rbd images access by qemu-kvm

"Wang, Warren" <Warren_Wang@xxxxxxxxxxxxxxxxx> · Mon, 31 Aug 2015 21:39:27 +0000

Hey Kenneth, it looks like you¹re just down the tollroad from me. I¹m in
Reston Town Center.

Just as a really rough estimate, I¹d say this is your max IOPS:
80 IOPS/spinner * 6 drives / 3 replicas = 160ish max sustained IOPS

It¹s more complicated than that, since you have a reasonable solid state
journal, lots of memory, etc, but that¹s a guess, since the backend will
eventually need to keep up. That being said, almost every time I have seen
blocked requests, there is some other underlying issue. I would say start
with implementation checks:

- checking connectivity between OSDs, with and without LACP (overkill for
your purposes)
- ensuring that the OSDs target drives are actually mounted instead of
scribbling to the root drive
- ensuring that the journal is properly implemented
- all OSDs on the same version
- Any OSDs crashing?
- packet fragmentation? We have to stick with 1500 MTU to prevent frags.
Don¹t assume you can run jumbo
	- You¹re not running much traffic, so a short capture on both sides and
wireshark should reveal any obvious issues

Is there anything in the ceph.log from a mon host? Grep for WRN. Also look
at the individual OSD log. This seems more like an implementation issue.
Happy to help out a local if you need more.

-- 
Warren Wang
Comcast Cloud (OpenStack)

On 8/31/15, 1:28 PM, "ceph-users on behalf of Kenneth Van Alstyne"
<ceph-users-bounces@xxxxxxxxxxxxxx on behalf of
kvanalstyne@xxxxxxxxxxxxxxx> wrote:

>Christian, et al:
>
>Sorry for the lack of information.  I wasn¹t sure what of our hardware
>specifications or Ceph configuration was useful information at this
>point.  Thanks for the feedback ‹ any feedback, is appreciated at this
>point, as I¹ve been beating my head against a wall trying to figure out
>what¹s going on.  (If anything.  Maybe the spindle count is indeed our
>upper limit or our SSDs really suck? :-) )
>
>To directly address your questions, see answers below:
>	- CBT is the Ceph Benchmarking Tool.  Since my question was more generic
>rather than with CBT itself, it was probably more useful to post in the
>ceph-users list rather than cbt.
>	- 8 Cores are from 2x quad core Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz
>	- The SSDs are indeed Intel S3500s.  I agree ‹ not ideal, but supposedly
>capable of up to 75,000 random 4KB reads/writes.  Throughput and
>longevity is quite low for an SSD, rated at about 400MB/s reads and
>100MB/s writes, though.  When we added these as journals in front of the
>SATA spindles, both VM performance and rados benchmark numbers were
>relatively unchanged.
>
>	- Regarding throughput vs iops, indeed ‹ the throughput that I¹m seeing
>is nearly worst case scenario, with all I/O being 4KB block size.  With
>RBD cache enabled and the writeback option set in the VM configuration, I
>was hoping more coalescing would occur, increasing the I/O block size.
>
>As an aside, the orchestration layer on top of KVM is OpenNebula if
>that¹s of any interest.
>
>VM information:
>	- Number = 15
>	- Worload = Mixed (I know, I know ‹ that¹s as vague of an answer as they
>come)  A handful of VMs are running some MySQL databases and some web
>applications in Apache Tomcat.  One is running a syslog server.
>Everything else is mostly static web page serving for a low number of
>users.
>
>I can duplicate the blocked request issue pretty consistently, just by
>running something simple like a ³yum -y update² in one VM.  While that is
>running, ceph -w and ceph -s show the following:
>root@dashboard:~# ceph -s
>    cluster f79d8c2a-3c14-49be-942d-83fc5f193a25
>     health HEALTH_WARN
>            1 requests are blocked > 32 sec
>     monmap e3: 3 mons at
>{storage-1=10.0.0.1:6789/0,storage-2=10.0.0.2:6789/0,storage-3=10.0.0.3:67
>89/0}
>            election epoch 136, quorum 0,1,2 storage-1,storage-2,storage-3
>     osdmap e75590: 6 osds: 6 up, 6 in
>      pgmap v3495103: 224 pgs, 1 pools, 826 GB data, 225 kobjects
>            2700 GB used, 2870 GB / 5571 GB avail
>                 224 active+clean
>  client io 3292 B/s rd, 2623 kB/s wr, 81 op/s
>
>2015-08-31 16:39:46.490696 mon.0 [INF] pgmap v3495096: 224 pgs: 224
>active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail
>2015-08-31 16:39:47.789982 mon.0 [INF] pgmap v3495097: 224 pgs: 224
>active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 0 B/s
>rd, 517 kB/s wr, 130 op/s
>2015-08-31 16:39:49.239033 mon.0 [INF] pgmap v3495098: 224 pgs: 224
>active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 0 B/s
>rd, 474 kB/s wr, 128 op/s
>2015-08-31 16:39:51.970679 mon.0 [INF] pgmap v3495099: 224 pgs: 224
>active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 0 B/s
>rd, 58662 B/s wr, 22 op/s
>2015-08-31 16:39:57.267697 mon.0 [INF] pgmap v3495100: 224 pgs: 224
>active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 11357
>B/s wr, 5 op/s
>2015-08-31 16:39:58.700312 mon.0 [INF] pgmap v3495101: 224 pgs: 224
>active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 1911
>B/s rd, 701 kB/s wr, 19 op/s
>2015-08-31 16:39:59.999624 mon.0 [INF] pgmap v3495102: 224 pgs: 224
>active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 4247
>B/s rd, 3092 kB/s wr, 66 op/s
>2015-08-31 16:40:02.156758 mon.0 [INF] pgmap v3495103: 224 pgs: 224
>active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 3292
>B/s rd, 2623 kB/s wr, 81 op/s
>2015-08-31 16:40:03.289101 mon.0 [INF] pgmap v3495104: 224 pgs: 224
>active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 65664
>B/s rd, 2163 kB/s wr, 76 op/s
>2015-08-31 16:40:04.679926 mon.0 [INF] pgmap v3495105: 224 pgs: 224
>active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 90075
>B/s rd, 3158 kB/s wr, 34 op/s
>2015-08-31 16:40:07.237293 mon.0 [INF] pgmap v3495106: 224 pgs: 224
>active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 0 B/s
>rd, 1899 kB/s wr, 29 op/s
>2015-08-31 16:40:08.303615 mon.0 [INF] pgmap v3495107: 224 pgs: 224
>active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 259
>kB/s rd, 2864 kB/s wr, 77 op/s
>2015-08-31 16:40:09.352817 mon.0 [INF] pgmap v3495108: 224 pgs: 224
>active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 411
>kB/s rd, 4093 kB/s wr, 115 op/s
>2015-08-31 16:40:11.951104 mon.0 [INF] pgmap v3495109: 224 pgs: 224
>active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 466
>kB/s rd, 1863 kB/s wr, 148 op/s
>
>I never seem to get anywhere near 300 op/s.  If spindle count is indeed
>the problem, is there anything else I can do to improve caching or I/O
>coalescing to deal with my crippling IOP limit due to the low number of
>spindles?
>
>Thanks,
>
>--
>Kenneth Van Alstyne
>Systems Architect
>Knight Point Systems, LLC
>Service-Disabled Veteran-Owned Business
>1775 Wiehle Avenue Suite 101 | Reston, VA 20190
>c: 228-547-8045 f: 571-266-3106
>www.knightpoint.com
>DHS EAGLE II Prime Contractor: FC1 SDVOSB Track
>GSA Schedule 70 SDVOSB: GS-35F-0646S
>GSA MOBIS Schedule: GS-10F-0404Y
>ISO 20000 / ISO 27001
>
>Notice: This e-mail message, including any attachments, is for the sole
>use of the intended recipient(s) and may contain confidential and
>privileged information. Any unauthorized review, copy, use, disclosure,
>or distribution is STRICTLY prohibited. If you are not the intended
>recipient, please contact the sender by reply e-mail and destroy all
>copies of the original message.
>
>> On Aug 31, 2015, at 11:01 AM, Christian Balzer <chibi@xxxxxxx> wrote:
>> 
>> 
>> Hello,
>> 
>> On Mon, 31 Aug 2015 08:31:57 -0500 Kenneth Van Alstyne wrote:
>> 
>>> Sorry about the repost from the cbt list, but it was suggested I post
>>> here as well:
>>> 
>> I wasn't even aware a CBT (what the heck does that acronym stand for?)
>> existed...
>> 
>>> I am attempting to track down some performance issues in a Ceph cluster
>>> recently deployed.  Our configuration is as follows: 3 storage nodes,
>> 3 nodes is, of course, bare minimum.
>> 
>>> each with:
>>> 		- 8 Cores
>> Of what, apples? Detailed information makes for better replies.
>> 
>>> 		- 64GB of RAM
>> Ample.
>> 
>>> 		- 2x 1TB 7200 RPM Spindle
>> Even if your cores where to be rotten apple ones, that's very few
>> spindles, so your CPU is unlikely to be the bottleneck.
>> 
>>> 		- 1x 120GB Intel SSD
>> Details, again. From your P.S. I conclude that these are S3500's,
>> definitely not my choice for journals when it comes to speed and
>>endurance.
>> 
>>> 		- 2x 10GBit NICs (In LACP Port-channel)
>> Massively overspec'ed considering your storage sinks/wells aka HDDs.
>> 
>>> 
>>> The OSD pool min_size is set to ³1² and ³size² is set to ³3².  When
>>> creating a new pool and running RADOS benchmarks, performance isn¹t bad
>>> ‹ about what I would expect from this hardware configuration:
>>> 
>> Rados bench uses by default 4MB "blocks", which is the optimum size for
>> (default) RBD pools.
>> Bandwidth does not equal IOPS (which are commonly measured in 4KB
>>blocks).
>> 
>>> WRITES:
>>> Total writes made:      207
>>> Write size:             4194304
>>> Bandwidth (MB/sec):     80.017
>>> 
>>> Stddev Bandwidth:       34.9212
>>> Max bandwidth (MB/sec): 120
>>> Min bandwidth (MB/sec): 0
>>> Average Latency:        0.797667
>>> Stddev Latency:         0.313188
>>> Max latency:            1.72237
>>> Min latency:            0.253286
>>> 
>>> RAND READS:
>>> Total time run:        10.127990
>>> Total reads made:     1263
>>> Read size:            4194304
>>> Bandwidth (MB/sec):    498.816
>>> 
>>> Average Latency:       0.127821
>>> Max latency:           0.464181
>>> Min latency:           0.0220425
>>> 
>>> This all looks fine, until we try to use the cluster for its purpose,
>>> which is to house images for qemu-kvm, which are access using librbd.
>> Not that it probably matters, but knowing if this Openstack, Ganeti or
>> something else might be of interest.
>> 
>>> I/O inside VMs have excessive I/O wait times (in the hundreds of ms at
>>> times, making some operating systems, like Windows unusable) and
>>> throughput struggles to exceed 10MB/s (or less).  Looking at ceph
>>> health, we see very low op/s numbers as well as throughput and the
>>> requests blocked number seems very high.  Any ideas as to what to look
>>> at here?
>>> 
>> Again, details.
>> 
>> How many VMs? 
>> What are they doing?
>> Keep in mind that the BEST sustained result you could hope for here
>> (ignoring Ceph overhead and network latency) is the IOPS of 2 HDDs, so
>> about 300 IOPS at best. TOTAL.
>> 
>>>     health HEALTH_WARN
>>>            8 requests are blocked > 32 sec
>>>     monmap e3: 3 mons at
>>> 
>>>{storage-1=10.0.0.1:6789/0,storage-2=10.0.0.2:6789/0,storage-3=10.0.0.3:
>>>6789/0}
>>> election epoch 128, quorum 0,1,2 storage-1,storage-2,storage-3 osdmap
>>> e69615: 6 osds: 6 up, 6 in pgmap v3148541: 224 pgs, 1 pools, 819 GB
>> 256 or 512 PGs would have been the "correct" number here, but that's of
>> little importance.
>> 
>>> data, 227 kobjects 2726 GB used, 2844 GB / 5571 GB avail
>>>                 224 active+clean
>>>  client io 3957 B/s rd, 3494 kB/s wr, 30 op/s
>>> 
>> That's a lot of data being written for a tiny cluster like yours.
>> Looking at your nodes with atop or similar tools will likely reveal that
>> your HDDs are quite the busy beavers and can't keep up.
>> 
>> Also prolonged values from "ceph -w" might be educational.
>> 
>> Regards,
>> 
>> Christian
>> 
>>> Of note, on the other list, I was asked to provide the following:
>>> 	- ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
>>> 	- The SSD is split into 8GB partitions. These 8GB partitions are
>>> used as journal devices, specified in /etc/ceph/ceph.conf.  For
>>>example:
>>> [osd.0] host = storage-1
>>> 		osd journal
>>> = /dev/mapper/INTEL_SSDSC2BB120G4_CVWL4363006R120LGNp1
>>> 	- rbd_cache is enabled and qemu cache is set to ³writeback"
>>> 	- rbd_concurrent_management_ops is unset, so it appears the
>>> default is ³10²
>>> 
>>> Thanks,
>>> 
>>> --
>>> Kenneth Van Alstyne
>>> Systems Architect
>>> Knight Point Systems, LLC
>>> Service-Disabled Veteran-Owned Business
>>> 1775 Wiehle Avenue Suite 101 | Reston, VA 20190
>>> c: 228-547-8045 f: 571-266-3106
>>> www.knightpoint.com
>>> DHS EAGLE II Prime Contractor: FC1 SDVOSB Track
>>> GSA Schedule 70 SDVOSB: GS-35F-0646S
>>> GSA MOBIS Schedule: GS-10F-0404Y
>>> ISO 20000 / ISO 27001
>>> 
>>> Notice: This e-mail message, including any attachments, is for the sole
>>> use of the intended recipient(s) and may contain confidential and
>>> privileged information. Any unauthorized review, copy, use, disclosure,
>>> or distribution is STRICTLY prohibited. If you are not the intended
>>> recipient, please contact the sender by reply e-mail and destroy all
>>> copies of the original message.
>>> 
>> 
>> 
>> -- 
>> Christian Balzer        Network/Systems Engineer
>> chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
>> http://www.gol.com/
>
>_______________________________________________
>ceph-users mailing list
>ceph-users@xxxxxxxxxxxxxx
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com