Re: Poor IOPS performance with Ceph

Jan Schermer <jan@xxxxxxxxxxx> · Wed, 9 Sep 2015 12:14:34 +0200



For the record

--direct=1 (or any O_DIRECT IO anywhere) is by itselt not guaranteed to be unbuffered and synchronous.
you need to add
--direct=1 --sync=1 --fsync=1 to make sure you are actually flushing the data somewhere. (This puts additional OPS in the queue though)
In case of RBD this is important because O_DIRECT write by itself could actually end in rbd cache.
Not sure how it is with different kernels, I believe this behaviour changed several times as applications have different assumptions on durability of O_DIRECT writes.
I can probably dig some reference to that if you want...

Jan

> On 09 Sep 2015, at 11:06, Nick Fisk <nick@xxxxxxxxxx> wrote:
> 
> It looks like you are using the kernel RBD client, ie you ran "rbd map ...." In which case the librbd settings in the ceph.conf won't have any affect as they are only for if you are using fio with the librbd engine.
> 
> There are several things you may have to do to improve Kernel client performance, but 1st thing you need to pass the "direct=1" flag to your fio job to get a realistic idea of your clusters performance. But be warned if you thought you had bad performance now, you will likely be shocked after you enable it.
> 
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
>> Daleep Bais
>> Sent: 09 September 2015 09:37
>> To: Nick Fisk <nick@xxxxxxxxxx>
>> Cc: Ceph-User <ceph-users@xxxxxxxx>
>> Subject: Re:  Poor IOPS performance with Ceph
>> 
>> Hi Nick,
>> 
>> I dont have separate SSD / HDD for journal. I am using a 10 G partition on the
>> same HDD for journaling. They are rotating HDD's and not SSD's.
>> 
>> I am using below command to run the test:
>> 
>> fio --name=test --filename=test --bs=4k  --size=4G --readwrite=read / write
>> 
>> I did few kernel tuning and that has improved my write IOPS. For read I am
>> using rbd_readahead  and also used read_ahead_kb kernel tuning
>> parameter.
>> 
>> Also I should mention that its not x86, its on armv7 32bit.
>> 
>> Thanks.
>> 
>> Daleep Singh Bais
>> 
>> 
>> 
>> On Wed, Sep 9, 2015 at 1:55 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>>> -----Original Message-----
>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
>> Of
>>> Daleep Bais
>>> Sent: 09 September 2015 09:18
>>> To: Ceph-User <ceph-users@xxxxxxxx>
>>> Subject:  Poor IOPS performance with Ceph
>>> 
>>> Hi,
>>> 
>>> I have made a test ceph cluster of 6 OSD's and 03 MON. I am testing the
>> read
>>> write performance for the test cluster and the read IOPS is  poor.
>>> When I individually test it for each HDD, I get good performance, whereas,
>>> when I test it for ceph cluster, it is poor.
>> 
>> Can you give any further details about your cluster. Are your HDD's backed by
>> SSD journals?
>> 
>>> 
>>> Between nodes, using iperf, I get good bandwidth.
>>> 
>>> My cluster info :
>>> 
>>> root@ceph-node3:~# ceph --version
>>> ceph version 9.0.2-752-g64d37b7
>>> (64d37b70a687eb63edf69a91196bb124651da210)
>>> root@ceph-node3:~# ceph -s
>>>    cluster 9654468b-5c78-44b9-9711-4a7c4455c480
>>>     health HEALTH_OK
>>>     monmap e9: 3 mons at {ceph-node10=192.168.1.210:6789/0,ceph-
>>> node17=192.168.1.217:6789/0,ceph-node3=192.168.1.203:6789/0}
>>>            election epoch 442, quorum 0,1,2 ceph-node3,ceph-node10,ceph-
>>> node17
>>>     osdmap e1850: 6 osds: 6 up, 6 in
>>>      pgmap v17400: 256 pgs, 2 pools, 9274 MB data, 2330 objects
>>>            9624 MB used, 5384 GB / 5394 GB avail
>>>                 256 active+clean
>>> 
>>> 
>>> I have mapped an RBD block device to client machine (Ubuntu 14) and from
>>> there, when I run tests using FIO, i get good write IOPS, however, read is
>>> poor comparatively.
>>> 
>>> Write IOPS : 44618 approx
>>> 
>>> Read IOPS : 7356 approx
>> 
>> 1st thing that strikes me is that your numbers are too good, unless these are
>> actually SSD's and not spinning HDD's? I would expect to get around a max of
>> 600 read IOPs for 6x 7.2k disks, so I guess either you are hitting the page
>> cache on the OSD node(s) or the librbd cache.
>> 
>> The writes are even higher, are you using the "direct=1" option in the Fio
>> job?
>> 
>>> 
>>> Pool replica - single
>>> pool 1 'test1' replicated size 1 min_size 1
>>> 
>>> I have implemented rbd_readahead in my ceph conf file also.
>>> Any suggestions in this regard with help me..
>>> 
>>> Thanks.
>>> 
>>> Daleep Singh Bais
>> 
>> 
>> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com