Re: rbd performance issue - can't find bottleneck

Mark Nelson <mnelson@xxxxxxxxxx> · Wed, 17 Jun 2015 08:34:57 -0500

On 06/17/2015 04:10 AM, Jacek Jarosiewicz wrote:
Hi,

We've been doing some testing of ceph hammer (0.94.2), but the
performance is very slow and we can't find what's causing the problem.

Initially we've started with four nodes with 10 osd's total.
The drives we've used were SATA enterprise drives and on top of that
we've used SSD drives as flashcache devices for SATA drives and for
storing OSD's journal.

The local tests on each of the four nodes are giving the results you'd
expect:

~500MB/s seq writes and reads from SSD's,
~40k iops random reads from SSD's,
~200MB/s seq writes and reads from SATA drives
~600 iops random reads from SATA drives

..but when we've tested this setup from a client we got rather slow
results.. so we've tried to find a bottleneck and tested the network by
connecting client to our nodes via NFS - and performance via NFS is as
expected (similar results to local tests, only slightly slower).

So we've reconfigured ceph to not use SATA drives and just setup OSD's
on SSD drives (we wanted to test if maybe this is a flashcache problem?)

..but to no success, the results of rbd i/o tests from two osd nodes
setup on SSD drives are like this:

~60MB/s seq writes
~100MB/s seq reads
~2-3k iops random reads

Is this per SSD or aggregate?

The client is an rbd mounted on a linux ubuntu box. All the servers (osd
nodes and the client) are running Ubuntu Server 14.04. We tried to
switch to CentOS 7 - but the results are the same.

Is this kernel RBD or a VM using QEMU/KVM?  You might want to try fio 
with the librbd engine and see if you get the same results.  Also, 
radosbench isn't exactly analogous, but you might try some large 
sequential write / sequential read tests just as a sanity check.

Here are some technical details about our setup:

Four exact same osd nodes:
E5-1630 CPU
32 GB RAM
Mellanox MT27520 56Gbps network cards
SATA controller LSI Logic SAS3008

Specs look fine.

Storage nodes are connected to SuperMicro chassis: 847E1C-R1K28JBOD

Is that where the SSDs live?  I'm not a fan of such heavy expander 
over-subscription, but if you are getting good results outside of Ceph 
I'm guessing it's something else.

Four monitors (one on each node). We do not use CephFS so we do not run
ceph-mds.

You'll want to go down to 3 or up to 5.  Even numbers of monitors don't 
really help you in any way (and can actually hurt).  I'd suggest 3.

During the tests we were monitoring all osd nodes and the client - we
haven't seen any problems on none of the hosts - load was low, there
were no cpu waits, no abnormal system interrupts, no i/o problems on the
disks - all the systems seemed to not sweat at all and yet the results
are rather dissatisfying.. we're kinda lost, any help will be appreciated.

You didn't mention the brand/model of SSDs.  Especially for writes this 
is important as ceph journal writes are O_DSYNC.  Drives that have 
proper write loss protection often can ignore ATA_CMD_FLUSH and do these 
very quickly while other drives may need to flush to the flash cells. 
Also, keep in mind for writes that if you have journals on the SSDs and 
3X replication, you'll be doing 6 writes for every client write.

For reads and read IOPs on SSDs, you might try disabling in-memory 
logging and ceph authentication.  You might be interested in some 
testing we did on a variety of SSDs here:

http://www.spinics.net/lists/ceph-users/msg15733.html

Cheers,
J

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com