Re: Poor performance with three nodes

Mark Nelson <mark.nelson@xxxxxxxxxxx> · Wed, 02 Oct 2013 17:44:07 -0500

On 10/02/2013 05:16 PM, Eric Lee Green wrote:
On 10/2/2013 2:24 PM, Gregory Farnum wrote:
There's a couple things here:
1) You aren't accounting for Ceph's journaling. Unlike a system such
as NFS, Ceph provides *very* strong data integrity guarantees under
failure conditions, and in order to do so it does full data
journaling. So, yes, cut your total disk bandwidth in half. (There's
also a lot of syncing which it manages carefully to reduce the cost,
but if you had other writes happening via your NFS/iSCSI setups that
might have been hit by the OSD running a sync on its disk, that could
be dramatically impacting the perceived throughput.)

I was running iostat on the storage servers to see what was happening
during the quicky test, and was not seeing large amounts of I/O taking
place, whether created by Ceph or not. From what you are saying I should
have seen roughly 120 megabytes per second being written on at least one
of the servers as the journal hit the server. I was seeing roughly 30
megabytes per second on each server.

2) Placing an OSD (with its journal) on a RAID-6 is about the worst
thing you can do for Ceph's performance; it does a lot of small
flushed-to-disk IOs in the journal in between the full data writes.
Try some other configuration?

I don't have any other configuration. These servers are in production.
They cannot be taken down. I don't have any more servers with 10 gigabit
Ethernet cards (which are *not* cheap).  I'm starting to suspect that
Ceph simply is not usable in my environment, which is a mixed-mode
shared environment rather than something that can be dedicated to any
single storage protocol. Toob ad.

FWIW, we did some testing in different RAID modes a while back.  These 
are very old results, but can give you an idea of how much difference 
there can be between different setups (in this case, with journals on 
the disks):

http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/

3) Did you explicitly set your PG counts at any point? They default to
8, which is entirely too low; given your setup you should have
400-1000 per pool.

They defaulted to 64, and according to the calculation in the
documentation should be 500/3 = 166 per pool for my configuration.
Still, that does not appear to be the issue here considering that I've
created one block device and that's the only Ceph traffic.  I just
raised it to 166 for the data pool, no difference.

4) There could have been something wrong/going on with the system;
though I doubt it. But if you can provide the output of "ceph -s"
that'll let us check the basics.

Everything looks healthy.

[root@stack1 ~]# ceph -s
   cluster 26206dba-e976-4217-a3d4-c9ea02c188be
    health HEALTH_OK
    monmap e2: 3 mons at
{stack1=10.200.0.3:6789/0,storage1=10.200.0.1:6789/0,storage2=10.200.0.2:6789/0},
election epoch 112, quorum 0,1,2 stack1,storage1,storage2
    osdmap e793: 5 osds: 5 up, 5 in
     pgmap v3504: 295 pgs: 295 active+clean; 21264 MB data, 30154 MB
used, 10205 GB / 10235 GB avail
    mdsmap e25: 1/1/1 up {0=stack1=up:active}

Separately, if all you want is to ensure that data resides on at least
two servers, there are better ways than saying "each server has two
daemons, so I'll do 3-copy". See eg

I was as concerned about performance as I was about redundancy when I
set it to three copies. I saw the crush map rules but for my purposes
simply having three replicas was sufficient.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com