Re: rbd performance issue - can't find bottleneck

Mark Nelson <mnelson@xxxxxxxxxx> · Wed, 17 Jun 2015 09:19:26 -0500

On 06/17/2015 09:03 AM, Jacek Jarosiewicz wrote:
On 06/17/2015 03:34 PM, Mark Nelson wrote:
On 06/17/2015 04:10 AM, Jacek Jarosiewicz wrote:
Hi,

[ cut ]

~60MB/s seq writes
~100MB/s seq reads
~2-3k iops random reads

Is this per SSD or aggregate?

aggregate (if I understand You correctly). This is what I see when I run
tests on client - a mapped and mounted rbd.

The client is an rbd mounted on a linux ubuntu box. All the servers (osd
nodes and the client) are running Ubuntu Server 14.04. We tried to
switch to CentOS 7 - but the results are the same.

Is this kernel RBD or a VM using QEMU/KVM?  You might want to try fio
with the librbd engine and see if you get the same results.  Also,
radosbench isn't exactly analogous, but you might try some large
sequential write / sequential read tests just as a sanity check.

This is kernel rbd - testing performance on vm's will be the next step.
I've tried fio with librbd, but the results were similar.
I'll run ther radosbench tests and post my results.

Here are some technical details about our setup:

Four exact same osd nodes:
E5-1630 CPU
32 GB RAM
Mellanox MT27520 56Gbps network cards
SATA controller LSI Logic SAS3008

Specs look fine.

Storage nodes are connected to SuperMicro chassis: 847E1C-R1K28JBOD

Is that where the SSDs live?  I'm not a fan of such heavy expander
over-subscription, but if you are getting good results outside of Ceph
I'm guessing it's something else.

No, the SSD's are connected to the integrated intel sata controller
(C610/X99)

The only disks that reside in the SuperMicro chasis are the SATA drives.
And on the last tests I don't use them - the results I gave are on SSD's
only (one SSD serves as OSD and the journal is on another SSD).

Four monitors (one on each node). We do not use CephFS so we do not run
ceph-mds.

You'll want to go down to 3 or up to 5.  Even numbers of monitors don't
really help you in any way (and can actually hurt).  I'd suggest 3.

OK, will do that, thanks!

You didn't mention the brand/model of SSDs.  Especially for writes this
is important as ceph journal writes are O_DSYNC.  Drives that have
proper write loss protection often can ignore ATA_CMD_FLUSH and do these
very quickly while other drives may need to flush to the flash cells.
Also, keep in mind for writes that if you have journals on the SSDs and
3X replication, you'll be doing 6 writes for every client write.

SSD's are INTEL SSDSC2BW240A4

Ah, if I'm not mistaken that's the Intel 530 right?  You'll want to see 
this thread by Stefan Priebe:

https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg05667.html

In fact it was the difference in Intel 520 and Intel 530 performance 
that triggered many of the different investigations that have taken 
place by various folks into SSD flushing behavior on ATA_CMD_FLUSH.  The 
gist of it is that the 520 is very fast but probably not safe.  The 530 
is safe but not fast.  The DC S3700 (and similar drives with super 
capacitors) are thought to be both fast and safe (though some drives 
like the crucual M500 and later misrepresented their power loss 
protection so you have to be very careful!)

The rbd pool is set to have min_size 1 and size 2.

For reads and read IOPs on SSDs, you might try disabling in-memory
logging and ceph authentication.  You might be interested in some
testing we did on a variety of SSDs here:

http://www.spinics.net/lists/ceph-users/msg15733.html

Will read up on that too, thanks!

J

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com