Re: Scaling RBD module

Josh Durgin <josh.durgin@xxxxxxxxxxx> · Wed, 18 Sep 2013 18:10:19 -0700

On 09/17/2013 03:30 PM, Somnath Roy wrote:
Hi,
I am running Ceph on a 3 node cluster and each of my server node is running 10 OSDs, one for each disk. I have one admin node and all the nodes are connected with 2 X 10G network. One network is for cluster and other one configured as public network.

Here is the status of my cluster.

~/fio_test# ceph -s

   cluster b2e0b4db-6342-490e-9c28-0aadf0188023
    health HEALTH_WARN clock skew detected on mon. <server-name-2>, mon. <server-name-3>
    monmap e1: 3 mons at {<server-name-1>=xxx.xxx.xxx.xxx:6789/0, <server-name-2>=xxx.xxx.xxx.xxx:6789/0, <server-name-3>=xxx.xxx.xxx.xxx:6789/0}, election epoch 64, quorum 0,1,2 <server-name-1>,<server-name-2>,<server-name-3>
    osdmap e391: 30 osds: 30 up, 30 in
     pgmap v5202: 30912 pgs: 30912 active+clean; 8494 MB data, 27912 MB used, 11145 GB / 11172 GB avail
    mdsmap e1: 0/0/1 up

I started with rados bench command to benchmark the read performance of this Cluster on a large pool (~10K PGs) and found that each rados client has a limitation. Each client can only drive up to a certain mark. Each server  node cpu utilization shows it is  around 85-90% idle and the admin node (from where rados client is running) is around ~80-85% idle. I am trying with 4K object size.

Note that rados bench with 4k objects is different from rbd with
4k-sized I/Os - rados bench sends each request to a new object,
while rbd objects are 4M by default.

Now, I started running more clients on the admin node and the performance is scaling till it hits the client cpu limit. Server still has the cpu of 30-35% idle. With small object size I must say that the ceph per osd cpu utilization is not promising!

After this, I started testing the rados block interface with kernel rbd module from my admin node.
I have created 8 images mapped on the pool having around 10K PGs and I am not able to scale up the performance by running fio (either by creating a software raid or running on individual /dev/rbd* instances). For example, running multiple fio instances (one in /dev/rbd1 and the other in /dev/rbd2)  the performance I am getting is half of what I am getting if running one instance. Here is my fio job script.

[random-reads]
ioengine=libaio
iodepth=32
filename=/dev/rbd1
rw=randread
bs=4k
direct=1
size=2G
numjobs=64

Let me know if I am following the proper procedure or not.

But, If my understanding is correct, kernel rbd module is acting as a client to the cluster and in one admin node I can run only one of such kernel instance.
If so, I am then limited to the client bottleneck that I stated earlier. The cpu utilization of the server side is around 85-90% idle, so, it is clear that client is not driving.

My question is, is there any way to hit the cluster  with more client from a single box while testing the rbd module ?

You can run multiple librbd instances easily (for example with
multiple runs of the rbd bench-write command).

The kernel rbd driver uses the same rados client instance for multiple
block devices by default. There's an option (noshare) to use a new
rados client instance for a newly mapped device, but it's not exposed
by the rbd cli. You need to use the sysfs interface that 'rbd map' uses
instead.

Once you've used rbd map once on a machine, the kernel will already
have the auth key stored, and you can use:

echo '1.2.3.4:6789 name=admin,key=client.admin,noshare poolname 
imagename' > /sys/bus/rbd/add

Where 1.2.3.4:6789 is the address of a monitor, and you're connecting
as client.admin.

You can use 'rbd unmap' as usual.

Josh
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com