On 06/17/2015 04:10 AM, Jacek Jarosiewicz wrote:
Hi, We've been doing some testing of ceph hammer (0.94.2), but the performance is very slow and we can't find what's causing the problem. Initially we've started with four nodes with 10 osd's total. The drives we've used were SATA enterprise drives and on top of that we've used SSD drives as flashcache devices for SATA drives and for storing OSD's journal. The local tests on each of the four nodes are giving the results you'd expect: ~500MB/s seq writes and reads from SSD's, ~40k iops random reads from SSD's, ~200MB/s seq writes and reads from SATA drives ~600 iops random reads from SATA drives ..but when we've tested this setup from a client we got rather slow results.. so we've tried to find a bottleneck and tested the network by connecting client to our nodes via NFS - and performance via NFS is as expected (similar results to local tests, only slightly slower). So we've reconfigured ceph to not use SATA drives and just setup OSD's on SSD drives (we wanted to test if maybe this is a flashcache problem?) ..but to no success, the results of rbd i/o tests from two osd nodes setup on SSD drives are like this: ~60MB/s seq writes ~100MB/s seq reads ~2-3k iops random reads
Is this per SSD or aggregate?
The client is an rbd mounted on a linux ubuntu box. All the servers (osd nodes and the client) are running Ubuntu Server 14.04. We tried to switch to CentOS 7 - but the results are the same.
Is this kernel RBD or a VM using QEMU/KVM? You might want to try fio with the librbd engine and see if you get the same results. Also, radosbench isn't exactly analogous, but you might try some large sequential write / sequential read tests just as a sanity check.
Here are some technical details about our setup: Four exact same osd nodes: E5-1630 CPU 32 GB RAM Mellanox MT27520 56Gbps network cards SATA controller LSI Logic SAS3008
Specs look fine.
Storage nodes are connected to SuperMicro chassis: 847E1C-R1K28JBOD
Is that where the SSDs live? I'm not a fan of such heavy expander over-subscription, but if you are getting good results outside of Ceph I'm guessing it's something else.
Four monitors (one on each node). We do not use CephFS so we do not run ceph-mds.
You'll want to go down to 3 or up to 5. Even numbers of monitors don't really help you in any way (and can actually hurt). I'd suggest 3.
During the tests we were monitoring all osd nodes and the client - we haven't seen any problems on none of the hosts - load was low, there were no cpu waits, no abnormal system interrupts, no i/o problems on the disks - all the systems seemed to not sweat at all and yet the results are rather dissatisfying.. we're kinda lost, any help will be appreciated.
You didn't mention the brand/model of SSDs. Especially for writes this is important as ceph journal writes are O_DSYNC. Drives that have proper write loss protection often can ignore ATA_CMD_FLUSH and do these very quickly while other drives may need to flush to the flash cells. Also, keep in mind for writes that if you have journals on the SSDs and 3X replication, you'll be doing 6 writes for every client write.
For reads and read IOPs on SSDs, you might try disabling in-memory logging and ceph authentication. You might be interested in some testing we did on a variety of SSDs here:
http://www.spinics.net/lists/ceph-users/msg15733.html
Cheers, J
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com