Re: Poor performance with three nodes

Warren Wang <warren@xxxxxxxxxxxxx> · Wed, 2 Oct 2013 18:13:08 -0400

I agree with Greg that this isn't a great test.  You'll need multiple clients to push the Ceph cluster, and you have to use oflag=direct if you're using dd.

The OSDs should be individual drives, not part of a RAID set, otherwise you're just creating extra work, unless you've reduced the number of copies to 1 in your ceph config.  

What I've seen is that a single threaded Ceph client maxes out around 50 MB/s for us, but the overall capacity is much, much higher.

Warren

Warren

On Wed, Oct 2, 2013 at 5:24 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:

On Wed, Oct 2, 2013 at 1:59 PM, Eric Lee Green <eric.lee.green@xxxxxxxxx> wrote:

> I have three storage servers that provide NFS and iSCSI services to my

> network, which serve data to four virtual machine compute hosts (two ESXi,

> two libvirt/kvm) with several dozen virtual machines . I decided to test out

> a Ceph deployment to see whether it could replace iSCSI as the primary way

> to provide block stores to my virtual machines, since this would allow

> better redundancy and better distribution of the load across the storage

> servers.

>

> I used ceph version 0.67.3 from RPM's. Because these are live servers

> providing NFS and iSCSI data they aren't a clean slate, so the Ceph

> datastores were created on XFS partitions. Each partition is on a single

> diskgroup (12-disk RAID6), of which there are two on each server, each

> connected to its own 3Gbit/sec SAS channel. The servers are all connected

> together with 10 gigabit Ethernet. The redundancy factor was set to 3 (three

> copies of each chunk of data) so that a chunk would be guaranteed to reside

> on at least two servers (since each server has two chunkstores).

>

> My experience with doing streaming writes via NFS or iSCSI to these servers

> is that the limiting factor is the performance of the SAS bus. That is, on

> the client side I top out at 240 megabytes per second on writes to a single

> disk group, a bit higher on reads, due to the 3 gigabit/sec SAS bus. When I

> am exercising both disk groups at once I am maxing out both SAS buses for

> double the performance. The 10 gigabit Ethernet w/9000 MTU apparently has

> plenty of bandwidth to saturate two 3 gigabit SAS buses.

>

> My first test of ceph was to create a 'test1' volume that was around 8

> gigabytes in size (or roughly the size of the root partition of one of my

> virtual machines), then test streaming reads and writes. The test for

> streaming reads and writes was simple:

>

> [root@stack1 ~]# dd if=/dev/zero of=/dev/rbd/data/test1 bs=524288

> dd: error writing ‘/dev/rbd/data/test1’: No space left on device

> 16193+0 records in

> 16192+0 records out

> 8489271296 bytes (8.5 GB) copied, 172.71 s, 49.2 MB/s

>

> [root@stack1 ~]# dd if=/dev/rbd/data/test1 of=/dev/null bs=524288

> 16192+0 records in

> 16192+0 records out

> 8489271296 bytes (8.5 GB) copied, 25.2494 s, 336 MB/s

>

> So:

>

> 1) Writes are truly appalling. They are not going at the speed of even a

> single disk drive (my disk drives are capable of streaming approximately 120

> megabytes per second).

>

> 2) Reads are more acceptable. I am getting better throughput than with a

> single SAS channel, as you would expect with reads striped across three SAS

> channels. Still, reads are slower than I expected given the speed of my

> infrastructure.

>

> Compared to Amazon EBS, reads appear roughly the same as EBS on an

> IO-enhanced instance, and writes are *much* slower.

>

> What this seems to indicate is either a) inherent Ceph performance issues

> for writes, or b) I have something misconfigured. There's simply too much of

> a mismatch between what the underlying hardware does with NFS and iSCSI, and

> what it does with Ceph, to consider this to be appropriate performance. My

> guess is (b), that I have something misconfigured. Any ideas what I should

> look for?

There's a couple things here:

1) You aren't accounting for Ceph's journaling. Unlike a system such

as NFS, Ceph provides *very* strong data integrity guarantees under

failure conditions, and in order to do so it does full data

journaling. So, yes, cut your total disk bandwidth in half. (There's

also a lot of syncing which it manages carefully to reduce the cost,

but if you had other writes happening via your NFS/iSCSI setups that

might have been hit by the OSD running a sync on its disk, that could

be dramatically impacting the perceived throughput.)

2) Placing an OSD (with its journal) on a RAID-6 is about the worst

thing you can do for Ceph's performance; it does a lot of small

flushed-to-disk IOs in the journal in between the full data writes.

Try some other configuration?

3) Did you explicitly set your PG counts at any point? They default to

8, which is entirely too low; given your setup you should have

400-1000 per pool.

4) There could have been something wrong/going on with the system;

though I doubt it. But if you can provide the output of "ceph -s"

that'll let us check the basics.

Separately, if all you want is to ensure that data resides on at least

two servers, there are better ways than saying "each server has two

daemons, so I'll do 3-copy". See eg

http://ceph.com/docs/master/rados/operations/crush-map/#crush-map-rules,

and set up the rules your cluster is using to use hosts as the minimal

unit of separation. :)

You can also run through Mark's last series of blog posts[1] to get

some idea of the performance you can get out of different setups.

-Greg

Software Engineer #42 @ http://inktank.com | http://ceph.com

[1]: http://ceph.com/performance-2/ceph-cuttlefish-vs-bobtail-part-5-results-summary-conclusion/

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com