Re: Ceph Performance Questions with rbd images access by qemu-kvm

Quentin Hartman <qhartman@xxxxxxxxxxxxxxxxxxx> · Mon, 31 Aug 2015 16:25:13 -0600

I would say you are probably simply IO starved because you're running too many VMs.
To follow on from Warren's response, if you spread those 160 available iops across 15 VMs, you are talking about roughly 10 iops per vm, assuming they have similar workloads. That's almost certainly too little. I would expect normal system respiration to consume that without even trying to do any real work.

The way I like to think of it is "fractions of a spindle" since that is the most meaningful thing to the people I'm usually talking to. It illustrates the resources in a more tangible way. You have 6 drives available for VM operations. That immediately gets cut down by a factor of three because of the replicas, so you have two available spindles. So, with 15 VM's, you have about 1/7 of a disk's worth of "attention" that can be paid to each VM. That's not nearly enough.

I run a setup that sounds a lot like a bigger version of what you are doing, and I've found as a rule of thumb that I need at least 1/3 of a disk per VM to get decent performance. I've created a reduced-redundancy pool to store unimportant VMs on, so that reduces the io load of those machines since they only have two replicas instead of three, so that has freed some io for "real" work. But side from that, you have to either reduce VMs or increase spindles...

QH

On Mon, Aug 31, 2015 at 3:39 PM, Wang, Warren <Warren_Wang@xxxxxxxxxxxxxxxxx> wrote:
Hey Kenneth, it looks like you¹re just down the tollroad from me. I¹m in

Reston Town Center.

Just as a really rough estimate, I¹d say this is your max IOPS:

80 IOPS/spinner * 6 drives / 3 replicas = 160ish max sustained IOPS

It¹s more complicated than that, since you have a reasonable solid state

journal, lots of memory, etc, but that¹s a guess, since the backend will

eventually need to keep up. That being said, almost every time I have seen

blocked requests, there is some other underlying issue. I would say start

with implementation checks:

- checking connectivity between OSDs, with and without LACP (overkill for

your purposes)

- ensuring that the OSDs target drives are actually mounted instead of

scribbling to the root drive

- ensuring that the journal is properly implemented

- all OSDs on the same version

- Any OSDs crashing?

- packet fragmentation? We have to stick with 1500 MTU to prevent frags.

Don¹t assume you can run jumbo

        - You¹re not running much traffic, so a short capture on both sides and

wireshark should reveal any obvious issues

Is there anything in the ceph.log from a mon host? Grep for WRN. Also look

at the individual OSD log. This seems more like an implementation issue.

Happy to help out a local if you need more.

--

Warren Wang

Comcast Cloud (OpenStack)

On 8/31/15, 1:28 PM, "ceph-users on behalf of Kenneth Van Alstyne"

<ceph-users-bounces@xxxxxxxxxxxxxx on behalf of

kvanalstyne@xxxxxxxxxxxxxxx> wrote:

>Christian, et al:

>

>Sorry for the lack of information.  I wasn¹t sure what of our hardware

>specifications or Ceph configuration was useful information at this

>point.  Thanks for the feedback ‹ any feedback, is appreciated at this

>point, as I¹ve been beating my head against a wall trying to figure out

>what¹s going on.  (If anything.  Maybe the spindle count is indeed our

>upper limit or our SSDs really suck? :-) )

>

>To directly address your questions, see answers below:

>       - CBT is the Ceph Benchmarking Tool.  Since my question was more generic

>rather than with CBT itself, it was probably more useful to post in the

>ceph-users list rather than cbt.

>       - 8 Cores are from 2x quad core Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz

>       - The SSDs are indeed Intel S3500s.  I agree ‹ not ideal, but supposedly

>capable of up to 75,000 random 4KB reads/writes.  Throughput and

>longevity is quite low for an SSD, rated at about 400MB/s reads and

>100MB/s writes, though.  When we added these as journals in front of the

>SATA spindles, both VM performance and rados benchmark numbers were

>relatively unchanged.

>

>       - Regarding throughput vs iops, indeed ‹ the throughput that I¹m seeing

>is nearly worst case scenario, with all I/O being 4KB block size.  With

>RBD cache enabled and the writeback option set in the VM configuration, I

>was hoping more coalescing would occur, increasing the I/O block size.

>

>As an aside, the orchestration layer on top of KVM is OpenNebula if

>that¹s of any interest.

>

>VM information:

>       - Number = 15

>       - Worload = Mixed (I know, I know ‹ that¹s as vague of an answer as they

>come)  A handful of VMs are running some MySQL databases and some web

>applications in Apache Tomcat.  One is running a syslog server.

>Everything else is mostly static web page serving for a low number of

>users.

>

>I can duplicate the blocked request issue pretty consistently, just by

>running something simple like a ³yum -y update² in one VM.  While that is

>running, ceph -w and ceph -s show the following:

>root@dashboard:~# ceph -s

>    cluster f79d8c2a-3c14-49be-942d-83fc5f193a25

>     health HEALTH_WARN

>            1 requests are blocked > 32 sec

>     monmap e3: 3 mons at

>{storage-1=10.0.0.1:6789/0,storage-2=10.0.0.2:6789/0,storage-3=10.0.0.3:67

>89/0}

>            election epoch 136, quorum 0,1,2 storage-1,storage-2,storage-3

>     osdmap e75590: 6 osds: 6 up, 6 in

>      pgmap v3495103: 224 pgs, 1 pools, 826 GB data, 225 kobjects

>            2700 GB used, 2870 GB / 5571 GB avail

>                 224 active+clean

>  client io 3292 B/s rd, 2623 kB/s wr, 81 op/s

>

>2015-08-31 16:39:46.490696 mon.0 [INF] pgmap v3495096: 224 pgs: 224

>active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail

>2015-08-31 16:39:47.789982 mon.0 [INF] pgmap v3495097: 224 pgs: 224

>active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 0 B/s

>rd, 517 kB/s wr, 130 op/s

>2015-08-31 16:39:49.239033 mon.0 [INF] pgmap v3495098: 224 pgs: 224

>active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 0 B/s

>rd, 474 kB/s wr, 128 op/s

>2015-08-31 16:39:51.970679 mon.0 [INF] pgmap v3495099: 224 pgs: 224

>active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 0 B/s

>rd, 58662 B/s wr, 22 op/s

>2015-08-31 16:39:57.267697 mon.0 [INF] pgmap v3495100: 224 pgs: 224

>active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 11357

>B/s wr, 5 op/s

>2015-08-31 16:39:58.700312 mon.0 [INF] pgmap v3495101: 224 pgs: 224

>active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 1911

>B/s rd, 701 kB/s wr, 19 op/s

>2015-08-31 16:39:59.999624 mon.0 [INF] pgmap v3495102: 224 pgs: 224

>active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 4247

>B/s rd, 3092 kB/s wr, 66 op/s

>2015-08-31 16:40:02.156758 mon.0 [INF] pgmap v3495103: 224 pgs: 224

>active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 3292

>B/s rd, 2623 kB/s wr, 81 op/s

>2015-08-31 16:40:03.289101 mon.0 [INF] pgmap v3495104: 224 pgs: 224

>active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 65664

>B/s rd, 2163 kB/s wr, 76 op/s

>2015-08-31 16:40:04.679926 mon.0 [INF] pgmap v3495105: 224 pgs: 224

>active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 90075

>B/s rd, 3158 kB/s wr, 34 op/s

>2015-08-31 16:40:07.237293 mon.0 [INF] pgmap v3495106: 224 pgs: 224

>active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 0 B/s

>rd, 1899 kB/s wr, 29 op/s

>2015-08-31 16:40:08.303615 mon.0 [INF] pgmap v3495107: 224 pgs: 224

>active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 259

>kB/s rd, 2864 kB/s wr, 77 op/s

>2015-08-31 16:40:09.352817 mon.0 [INF] pgmap v3495108: 224 pgs: 224

>active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 411

>kB/s rd, 4093 kB/s wr, 115 op/s

>2015-08-31 16:40:11.951104 mon.0 [INF] pgmap v3495109: 224 pgs: 224

>active+clean; 826 GB data, 2700 GB used, 2870 GB / 5571 GB avail; 466

>kB/s rd, 1863 kB/s wr, 148 op/s

>

>I never seem to get anywhere near 300 op/s.  If spindle count is indeed

>the problem, is there anything else I can do to improve caching or I/O

>coalescing to deal with my crippling IOP limit due to the low number of

>spindles?

>

>Thanks,

>

>--

>Kenneth Van Alstyne

>Systems Architect

>Knight Point Systems, LLC

>Service-Disabled Veteran-Owned Business

>1775 Wiehle Avenue Suite 101 | Reston, VA 20190

>c: 228-547-8045 f: 571-266-3106

>www.knightpoint.com

>DHS EAGLE II Prime Contractor: FC1 SDVOSB Track

>GSA Schedule 70 SDVOSB: GS-35F-0646S

>GSA MOBIS Schedule: GS-10F-0404Y

>ISO 20000 / ISO 27001

>

>Notice: This e-mail message, including any attachments, is for the sole

>use of the intended recipient(s) and may contain confidential and

>privileged information. Any unauthorized review, copy, use, disclosure,

>or distribution is STRICTLY prohibited. If you are not the intended

>recipient, please contact the sender by reply e-mail and destroy all

>copies of the original message.

>

>> On Aug 31, 2015, at 11:01 AM, Christian Balzer <chibi@xxxxxxx> wrote:

>>

>>

>> Hello,

>>

>> On Mon, 31 Aug 2015 08:31:57 -0500 Kenneth Van Alstyne wrote:

>>

>>> Sorry about the repost from the cbt list, but it was suggested I post

>>> here as well:

>>>

>> I wasn't even aware a CBT (what the heck does that acronym stand for?)

>> existed...

>>

>>> I am attempting to track down some performance issues in a Ceph cluster

>>> recently deployed.  Our configuration is as follows: 3 storage nodes,

>> 3 nodes is, of course, bare minimum.

>>

>>> each with:

>>>             - 8 Cores

>> Of what, apples? Detailed information makes for better replies.

>>

>>>             - 64GB of RAM

>> Ample.

>>

>>>             - 2x 1TB 7200 RPM Spindle

>> Even if your cores where to be rotten apple ones, that's very few

>> spindles, so your CPU is unlikely to be the bottleneck.

>>

>>>             - 1x 120GB Intel SSD

>> Details, again. From your P.S. I conclude that these are S3500's,

>> definitely not my choice for journals when it comes to speed and

>>endurance.

>>

>>>             - 2x 10GBit NICs (In LACP Port-channel)

>> Massively overspec'ed considering your storage sinks/wells aka HDDs.

>>

>>>

>>> The OSD pool min_size is set to ³1² and ³size² is set to ³3².  When

>>> creating a new pool and running RADOS benchmarks, performance isn¹t bad

>>> ‹ about what I would expect from this hardware configuration:

>>>

>> Rados bench uses by default 4MB "blocks", which is the optimum size for

>> (default) RBD pools.

>> Bandwidth does not equal IOPS (which are commonly measured in 4KB

>>blocks).

>>

>>> WRITES:

>>> Total writes made:      207

>>> Write size:             4194304

>>> Bandwidth (MB/sec):     80.017

>>>

>>> Stddev Bandwidth:       34.9212

>>> Max bandwidth (MB/sec): 120

>>> Min bandwidth (MB/sec): 0

>>> Average Latency:        0.797667

>>> Stddev Latency:         0.313188

>>> Max latency:            1.72237

>>> Min latency:            0.253286

>>>

>>> RAND READS:

>>> Total time run:        10.127990

>>> Total reads made:     1263

>>> Read size:            4194304

>>> Bandwidth (MB/sec):    498.816

>>>

>>> Average Latency:       0.127821

>>> Max latency:           0.464181

>>> Min latency:           0.0220425

>>>

>>> This all looks fine, until we try to use the cluster for its purpose,

>>> which is to house images for qemu-kvm, which are access using librbd.

>> Not that it probably matters, but knowing if this Openstack, Ganeti or

>> something else might be of interest.

>>

>>> I/O inside VMs have excessive I/O wait times (in the hundreds of ms at

>>> times, making some operating systems, like Windows unusable) and

>>> throughput struggles to exceed 10MB/s (or less).  Looking at ceph

>>> health, we see very low op/s numbers as well as throughput and the

>>> requests blocked number seems very high.  Any ideas as to what to look

>>> at here?

>>>

>> Again, details.

>>

>> How many VMs?

>> What are they doing?

>> Keep in mind that the BEST sustained result you could hope for here

>> (ignoring Ceph overhead and network latency) is the IOPS of 2 HDDs, so

>> about 300 IOPS at best. TOTAL.

>>

>>>     health HEALTH_WARN

>>>            8 requests are blocked > 32 sec

>>>     monmap e3: 3 mons at

>>>

>>>{storage-1=10.0.0.1:6789/0,storage-2=10.0.0.2:6789/0,storage-3=10.0.0.3:

>>>6789/0}

>>> election epoch 128, quorum 0,1,2 storage-1,storage-2,storage-3 osdmap

>>> e69615: 6 osds: 6 up, 6 in pgmap v3148541: 224 pgs, 1 pools, 819 GB

>> 256 or 512 PGs would have been the "correct" number here, but that's of

>> little importance.

>>

>>> data, 227 kobjects 2726 GB used, 2844 GB / 5571 GB avail

>>>                 224 active+clean

>>>  client io 3957 B/s rd, 3494 kB/s wr, 30 op/s

>>>

>> That's a lot of data being written for a tiny cluster like yours.

>> Looking at your nodes with atop or similar tools will likely reveal that

>> your HDDs are quite the busy beavers and can't keep up.

>>

>> Also prolonged values from "ceph -w" might be educational.

>>

>> Regards,

>>

>> Christian

>>

>>> Of note, on the other list, I was asked to provide the following:

>>>     - ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)

>>>     - The SSD is split into 8GB partitions. These 8GB partitions are

>>> used as journal devices, specified in /etc/ceph/ceph.conf.  For

>>>example:

>>> [osd.0] host = storage-1

>>>             osd journal

>>> = /dev/mapper/INTEL_SSDSC2BB120G4_CVWL4363006R120LGNp1

>>>     - rbd_cache is enabled and qemu cache is set to ³writeback"

>>>     - rbd_concurrent_management_ops is unset, so it appears the

>>> default is ³10²

>>>

>>> Thanks,

>>>

>>> --

>>> Kenneth Van Alstyne

>>> Systems Architect

>>> Knight Point Systems, LLC

>>> Service-Disabled Veteran-Owned Business

>>> 1775 Wiehle Avenue Suite 101 | Reston, VA 20190

>>> c: 228-547-8045 f: 571-266-3106

>>> www.knightpoint.com

>>> DHS EAGLE II Prime Contractor: FC1 SDVOSB Track

>>> GSA Schedule 70 SDVOSB: GS-35F-0646S

>>> GSA MOBIS Schedule: GS-10F-0404Y

>>> ISO 20000 / ISO 27001

>>>

>>> Notice: This e-mail message, including any attachments, is for the sole

>>> use of the intended recipient(s) and may contain confidential and

>>> privileged information. Any unauthorized review, copy, use, disclosure,

>>> or distribution is STRICTLY prohibited. If you are not the intended

>>> recipient, please contact the sender by reply e-mail and destroy all

>>> copies of the original message.

>>>

>>

>>

>> --

>> Christian Balzer        Network/Systems Engineer

>> chibi@xxxxxxx        Global OnLine Japan/Fusion Communications

>> http://www.gol.com/

>

>_______________________________________________

>ceph-users mailing list

>ceph-users@xxxxxxxxxxxxxx

>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com