Re: Ceph Performance Questions with rbd images access by qemu-kvm

Christian Balzer <chibi@xxxxxxx> · Tue, 1 Sep 2015 09:29:27 +0900

Hello,

On Mon, 31 Aug 2015 12:28:15 -0500 Kenneth Van Alstyne wrote:

In addition to the spot on comments by Warren and Quentin, verify this by
watching your nodes with atop, iostat, etc. 
The culprit (HDDs) should be plainly visible.

More inline:

> Christian, et al:
> 
> Sorry for the lack of information.  I wasn’t sure what of our hardware
> specifications or Ceph configuration was useful information at this
> point.  Thanks for the feedback — any feedback, is appreciated at this
> point, as I’ve been beating my head against a wall trying to figure out
> what’s going on.  (If anything.  Maybe the spindle count is indeed our
> upper limit or our SSDs really suck? :-) )
>
Your SSDs aren't the problem.

> To directly address your questions, see answers below:
> 	- CBT is the Ceph Benchmarking Tool.  Since my question was more
> generic rather than with CBT itself, it was probably more useful to post
> in the ceph-users list rather than cbt.
> 	- 8 Cores are from 2x quad core Intel(R) Xeon(R) CPU E5-2609 0 @
> 2.40GHz
Not your problem either.

> 	- The SSDs are indeed Intel S3500s.  I agree — not ideal, but
> supposedly capable of up to 75,000 random 4KB reads/writes.  Throughput
> and longevity is quite low for an SSD, rated at about 400MB/s reads and
> 100MB/s writes, though.  When we added these as journals in front of the
> SATA spindles, both VM performance and rados benchmark numbers were
> relatively unchanged.
>
The only thing relevant in regards to journal SSDs is the sequential write
speed (SYNC), they don't seek and normally don't get read either.
This is why a 200GB DC S3700 is a better journal SSD than the 200GB S3710
which is faster in any other aspect but sequential writes. ^o^

Latency should have gone down with the SSD journals in place, but that's
their main function/benefit. 

> 	- Regarding throughput vs iops, indeed — the throughput that I’m
> seeing is nearly worst case scenario, with all I/O being 4KB block
> size.  With RBD cache enabled and the writeback option set in the VM
> configuration, I was hoping more coalescing would occur, increasing the
> I/O block size.
> 
That can only help with non-SYNC writes, so your MySQL VMs and certain
file system ops will have to bypass that and that hurts.

> As an aside, the orchestration layer on top of KVM is OpenNebula if
> that’s of any interest.
> 
It is actually, as I've been eying OpenNebula (alas no Debian Jessie
packages). However not relevant to your problem indeed.

> VM information:
> 	- Number = 15
> 	- Worload = Mixed (I know, I know — that’s as vague of an answer
> as they come)  A handful of VMs are running some MySQL databases and
> some web applications in Apache Tomcat.  One is running a syslog
> server.  Everything else is mostly static web page serving for a low
> number of users.
> 
As others have mentioned, would you expect this load to work well with
just 2 HDDs and via NFS to introduce network latency?

> I can duplicate the blocked request issue pretty consistently, just by
> running something simple like a “yum -y update” in one VM.  While that
> is running, ceph -w and ceph -s show the following: root@dashboard:~#
> ceph -s cluster f79d8c2a-3c14-49be-942d-83fc5f193a25 health HEALTH_WARN
>             1 requests are blocked > 32 sec
>      monmap e3: 3 mons at
> {storage-1=10.0.0.1:6789/0,storage-2=10.0.0.2:6789/0,storage-3=10.0.0.3:6789/0}
> election epoch 136, quorum 0,1,2 storage-1,storage-2,storage-3 osdmap
> e75590: 6 osds: 6 up, 6 in pgmap v3495103: 224 pgs, 1 pools, 826 GB
> data, 225 kobjects 2700 GB used, 2870 GB / 5571 GB avail
>                  224 active+clean
>   client io 3292 B/s rd, 2623 kB/s wr, 81 op/s
> 
[snip]
> 466 kB/s rd, 1863 kB/s wr, 148 op/s
> 
This is a good sample, unless your reads can be satisfied from page cache
on your storage nodes or inside your VMs (more memory for the VMs may help
here), they are competing (seeks) with your write requests. So yeah, this
is probably as good as it gets.

> I never seem to get anywhere near 300 op/s.  If spindle count is indeed
> the problem, is there anything else I can do to improve caching or I/O
> coalescing to deal with my crippling IOP limit due to the low number of
> spindles?
> 
Other than replacing spindles with SSDs, not really. 
Your client workload is too mixed for anything else but that or massively
more spindles.

On the other hand, I have a cluster with very few OSDs (4!), hundreds of
VMs and typical activities like this: 11750 kB/s wr, 1426 op/s.
Note the lack of writes, all these VMs run the same OS/application and are
basically write only. 
Adding to that the OSDs are actually RAIDs behind a 4GB controller cache
and thus the "disks" aren't busy at all.
However reads, like rebooting VMs, impact this cluster quite a bit.

Christian
> Thanks,
> 
> --
> Kenneth Van Alstyne
> Systems Architect
> Knight Point Systems, LLC
> Service-Disabled Veteran-Owned Business
> 1775 Wiehle Avenue Suite 101 | Reston, VA 20190
> c: 228-547-8045 f: 571-266-3106
> www.knightpoint.com 
> DHS EAGLE II Prime Contractor: FC1 SDVOSB Track
> GSA Schedule 70 SDVOSB: GS-35F-0646S
> GSA MOBIS Schedule: GS-10F-0404Y
> ISO 20000 / ISO 27001
> 
> Notice: This e-mail message, including any attachments, is for the sole
> use of the intended recipient(s) and may contain confidential and
> privileged information. Any unauthorized review, copy, use, disclosure,
> or distribution is STRICTLY prohibited. If you are not the intended
> recipient, please contact the sender by reply e-mail and destroy all
> copies of the original message.
> 
> > On Aug 31, 2015, at 11:01 AM, Christian Balzer <chibi@xxxxxxx> wrote:
> > 
> > 
> > Hello,
> > 
> > On Mon, 31 Aug 2015 08:31:57 -0500 Kenneth Van Alstyne wrote:
> > 
> >> Sorry about the repost from the cbt list, but it was suggested I post
> >> here as well:
> >> 
> > I wasn't even aware a CBT (what the heck does that acronym stand for?)
> > existed...
> > 
> >> I am attempting to track down some performance issues in a Ceph
> >> cluster recently deployed.  Our configuration is as follows: 3
> >> storage nodes,
> > 3 nodes is, of course, bare minimum. 
> > 
> >> each with:
> >> 		- 8 Cores
> > Of what, apples? Detailed information makes for better replies.
> > 
> >> 		- 64GB of RAM
> > Ample.
> > 
> >> 		- 2x 1TB 7200 RPM Spindle
> > Even if your cores where to be rotten apple ones, that's very few
> > spindles, so your CPU is unlikely to be the bottleneck.
> > 
> >> 		- 1x 120GB Intel SSD
> > Details, again. From your P.S. I conclude that these are S3500's,
> > definitely not my choice for journals when it comes to speed and
> > endurance.
> > 
> >> 		- 2x 10GBit NICs (In LACP Port-channel)
> > Massively overspec'ed considering your storage sinks/wells aka HDDs.
> > 
> >> 
> >> The OSD pool min_size is set to “1” and “size” is set to “3”.  When
> >> creating a new pool and running RADOS benchmarks, performance isn’t
> >> bad — about what I would expect from this hardware configuration:
> >> 
> > Rados bench uses by default 4MB "blocks", which is the optimum size for
> > (default) RBD pools.
> > Bandwidth does not equal IOPS (which are commonly measured in 4KB
> > blocks).
> > 
> >> WRITES:
> >> Total writes made:      207
> >> Write size:             4194304
> >> Bandwidth (MB/sec):     80.017 
> >> 
> >> Stddev Bandwidth:       34.9212
> >> Max bandwidth (MB/sec): 120
> >> Min bandwidth (MB/sec): 0
> >> Average Latency:        0.797667
> >> Stddev Latency:         0.313188
> >> Max latency:            1.72237
> >> Min latency:            0.253286
> >> 
> >> RAND READS:
> >> Total time run:        10.127990
> >> Total reads made:     1263
> >> Read size:            4194304
> >> Bandwidth (MB/sec):    498.816 
> >> 
> >> Average Latency:       0.127821
> >> Max latency:           0.464181
> >> Min latency:           0.0220425
> >> 
> >> This all looks fine, until we try to use the cluster for its purpose,
> >> which is to house images for qemu-kvm, which are access using librbd.
> > Not that it probably matters, but knowing if this Openstack, Ganeti or
> > something else might be of interest.
> > 
> >> I/O inside VMs have excessive I/O wait times (in the hundreds of ms at
> >> times, making some operating systems, like Windows unusable) and
> >> throughput struggles to exceed 10MB/s (or less).  Looking at ceph
> >> health, we see very low op/s numbers as well as throughput and the
> >> requests blocked number seems very high.  Any ideas as to what to look
> >> at here?
> >> 
> > Again, details.
> > 
> > How many VMs? 
> > What are they doing? 
> > Keep in mind that the BEST sustained result you could hope for here
> > (ignoring Ceph overhead and network latency) is the IOPS of 2 HDDs, so
> > about 300 IOPS at best. TOTAL.
> > 
> >>     health HEALTH_WARN
> >>            8 requests are blocked > 32 sec
> >>     monmap e3: 3 mons at
> >> {storage-1=10.0.0.1:6789/0,storage-2=10.0.0.2:6789/0,storage-3=10.0.0.3:6789/0}
> >> election epoch 128, quorum 0,1,2 storage-1,storage-2,storage-3 osdmap
> >> e69615: 6 osds: 6 up, 6 in pgmap v3148541: 224 pgs, 1 pools, 819 GB
> > 256 or 512 PGs would have been the "correct" number here, but that's of
> > little importance. 
> > 
> >> data, 227 kobjects 2726 GB used, 2844 GB / 5571 GB avail
> >>                 224 active+clean
> >>  client io 3957 B/s rd, 3494 kB/s wr, 30 op/s
> >> 
> > That's a lot of data being written for a tiny cluster like yours.
> > Looking at your nodes with atop or similar tools will likely reveal
> > that your HDDs are quite the busy beavers and can't keep up.
> > 
> > Also prolonged values from "ceph -w" might be educational. 
> > 
> > Regards,
> > 
> > Christian
> > 
> >> Of note, on the other list, I was asked to provide the following:
> >> 	- ceph version 0.94.1
> >> (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
> >> 	- The SSD is split into 8GB partitions. These 8GB partitions
> >> are used as journal devices, specified in /etc/ceph/ceph.conf.  For
> >> example: [osd.0] host = storage-1
> >> 		osd journal
> >> = /dev/mapper/INTEL_SSDSC2BB120G4_CVWL4363006R120LGNp1
> >> 	- rbd_cache is enabled and qemu cache is set to “writeback"
> >> 	- rbd_concurrent_management_ops is unset, so it appears the
> >> default is “10”
> >> 
> >> Thanks,
> >> 
> >> --
> >> Kenneth Van Alstyne
> >> Systems Architect
> >> Knight Point Systems, LLC
> >> Service-Disabled Veteran-Owned Business
> >> 1775 Wiehle Avenue Suite 101 | Reston, VA 20190
> >> c: 228-547-8045 f: 571-266-3106
> >> www.knightpoint.com 
> >> DHS EAGLE II Prime Contractor: FC1 SDVOSB Track
> >> GSA Schedule 70 SDVOSB: GS-35F-0646S
> >> GSA MOBIS Schedule: GS-10F-0404Y
> >> ISO 20000 / ISO 27001
> >> 
> >> Notice: This e-mail message, including any attachments, is for the
> >> sole use of the intended recipient(s) and may contain confidential and
> >> privileged information. Any unauthorized review, copy, use,
> >> disclosure, or distribution is STRICTLY prohibited. If you are not
> >> the intended recipient, please contact the sender by reply e-mail and
> >> destroy all copies of the original message.
> >> 
> > 
> > 
> > -- 
> > Christian Balzer        Network/Systems Engineer                
> > chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
> > http://www.gol.com/
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com