Re: Scaling RBD module

Somnath Roy <Somnath.Roy@xxxxxxxxxxx> · Tue, 24 Sep 2013 22:23:41 +0000

Hi Sage,
We did quite a few experiment to see how ceph read performance can scale up. Here is the summary.

1.
First we tried to see how far a single node cluster with one osd can scale up. We started with cuttlefish release and the entire osd file system is on the ssd. What we saw with 4K size object and with single rados client with dedicated
 10G network, throughput can't go beyond a certain point.
We dig through the code and found out SimpleMessenger is opening single socket connection (per client)to talk to the osd. Also, we saw there is only one dispatcher Q (Dispatch thread)/ SimpleMessenger to carry these requests to OSD.
 We started adding more dispatcher threads in Dispatch Q, rearrange several locks in the Pipe.cc to identify the bottleneck. What we end up discovering is that there is bottleneck both in upstream as well as in the downstream at osd level and changing the locking
 scheme in io path  will affect lot of other codes (that we don't even know).
So, we stopped that activity and started workaround the upstream bottleneck by introducing more clients to the single OSD. What we saw single OSD is scaling with lot of cpu utilization. To produce ~40K iops (4K) it is taking almost 12
 core of cpu.
Another point, I didn't see this single osd scale with the Dumpling release with the multiple clients !! Something changed..

2.   After that, we setup a proper cluster with 3 high performing nodes and total 30 osds. Here also, we are seeing single rados bech client as well as rbd client instance is not scaling beyond a certain limit. It is not able to generate
 much load as node cpu utilization remains very low. But running multiple client instance the performance is scaling till hit the cpu limit.

So, it is pretty clear we are not able to saturate anything with single client and that's why the 'noshare' option was very helpful to measure the rbd performance benchmark. I have a single osd/single client level callgrind  data.
 Attachment is not going through the community I guess and that’s why can’t send it to you.

Now, I am doing the benchmark for radosgw and I think I am stuck with similar bottleneck here. Could you please confirm that if radosgw also opening single client instance to the cluster ?                                                                                                                     

If so, is there any similar option like 'noshare' in this case ? Here also, creating multiple radosgw instance on separate nodes the performance is scaling.
BTW, is there a way to run multiple radosgw to a single node or it has to be one/node ?

Thanks & Regards
Somnath

-----Original Message-----

From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Sage Weil

Sent: Tuesday, September 24, 2013 2:16 PM

To: Travis Rhoden

Cc: Josh Durgin; ceph-devel@xxxxxxxxxxxxxxx; Anirban Ray;
ceph-users@xxxxxxxxxxxxxx

Subject: Re:  Scaling RBD module

On Tue, 24 Sep 2013, Travis Rhoden wrote:
> This "noshare" option may have just helped me a ton -- I sure wish I

> would have asked similar questions sooner, because I have seen the

> same failure to scale.  =)
> 
> One question -- when using the "noshare" option (or really, even

> without it) are there any practical limits on the number of RBDs that

> can be mounted?  I have servers with ~100 RBDs on them each, and am

> wondering if I switch them all over to using "noshare" if anything is

> going to blow up, use a ton more memory, etc.  Even without noshare,

> are there any known limits to how many RBDs can be mapped?

With noshare each mapped image will appear as a separate client instance, which means it will have it's own session with teh monitors and own TCP connections to the OSDs.  It may be a viable workaround for now but in general I would
 not recommend it.

I'm very curious what the scaling issue is with the shared client.  Do you have a working perf that can capture callgraph information on this machine?

sage

> 
> Thanks!
> 
>  - Travis
> 
> 
> On Thu, Sep 19, 2013 at 8:03 PM, Somnath Roy <Somnath.Roy@xxxxxxxxxxx>
> wrote:
>       Thanks Josh !
>       I am able to successfully add this noshare option in the image
>       mapping now. Looking at dmesg output, I found that was indeed
>       the secret key problem. Block performance is scaling now.
> 
>       Regards
>       Somnath
> 
>       -----Original Message-----
>       From: 
ceph-devel-owner@xxxxxxxxxxxxxxx
>       [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Josh
>       Durgin
>       Sent: Thursday, September 19, 2013 12:24 PM
>       To: Somnath Roy
>       Cc: Sage Weil; 
ceph-devel@xxxxxxxxxxxxxxx; Anirban Ray;
>       ceph-users@xxxxxxxxxxxxxx
>       Subject: Re:  Scaling RBD module
> 
>       On 09/19/2013 12:04 PM, Somnath Roy wrote:
>       > Hi Josh,
>       > Thanks for the information. I am trying to add the following
>       but hitting some permission issue.
>       >
>       > root@emsclient:/etc# echo
>       <mon-1>:6789,<mon-2>:6789,<mon-3>:6789
>       > name=admin,key=client.admin,noshare test_rbd ceph_block_test'
>       >
>       > /sys/bus/rbd/add
>       > -bash: echo: write error: Operation not permitted
> 
>       If you check dmesg, it will probably show an error trying to
>       authenticate to the cluster.
> 
>       Instead of key=client.admin, you can pass the base64 secret
>       value as shown in 'ceph auth list' with the
>       secret=XXXXXXXXXXXXXXXXXXXXX option.
> 
>       BTW, there's a ticket for adding the noshare option to rbd map
>       so using the sysfs interface like this is never necessary:
> 
>       http://tracker.ceph.com/issues/6264
> 
>       Josh
> 
>       > Here is the contents of rbd directory..
>       >
>       > root@emsclient:/sys/bus/rbd# ll
>       > total 0
>       > drwxr-xr-x  4 root root    0 Sep 19 11:59 ./
>       > drwxr-xr-x 30 root root    0 Sep 13 11:41 ../
>       > --w-------  1 root root 4096 Sep 19 11:59 add
>       > drwxr-xr-x  2 root root    0 Sep 19 12:03 devices/
>       > drwxr-xr-x  2 root root    0 Sep 19 12:03 drivers/
>       > -rw-r--r--  1 root root 4096 Sep 19 12:03 drivers_autoprobe
>       > --w-------  1 root root 4096 Sep 19 12:03 drivers_probe
>       > --w-------  1 root root 4096 Sep 19 12:03 remove
>       > --w-------  1 root root 4096 Sep 19 11:59 uevent
>       >
>       >
>       > I checked even if I am logged in as root , I can't write
>       anything on /sys.
>       >
>       > Here is the Ubuntu version I am using..
>       >
>       > root@emsclient:/etc# lsb_release -a
>       > No LSB modules are available.
>       > Distributor ID: Ubuntu
>       > Description:    Ubuntu 13.04
>       > Release:        13.04
>       > Codename:       raring
>       >
>       > Here is the mount information....
>       >
>       > root@emsclient:/etc# mount
>       > /dev/mapper/emsclient--vg-root on / type ext4
>       (rw,errors=remount-ro)
>       > proc on /proc type proc (rw,noexec,nosuid,nodev) sysfs on /sys
>       type
>       > sysfs (rw,noexec,nosuid,nodev) none on /sys/fs/cgroup type
>       tmpfs (rw)
>       > none on /sys/fs/fuse/connections type fusectl (rw) none on
>       > /sys/kernel/debug type debugfs (rw) none on
>       /sys/kernel/security type
>       > securityfs (rw) udev on /dev type devtmpfs (rw,mode=0755)
>       devpts on
>       > /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
>       > tmpfs on /run type tmpfs (rw,noexec,nosuid,size=10%,mode=0755)
>       > none on /run/lock type tmpfs
>       (rw,noexec,nosuid,nodev,size=5242880)
>       > none on /run/shm type tmpfs (rw,nosuid,nodev) none on
>       /run/user type
>       > tmpfs (rw,noexec,nosuid,nodev,size=104857600,mode=0755)
>       > /dev/sda1 on /boot type ext2 (rw)
>       > /dev/mapper/emsclient--vg-home on /home type ext4 (rw)
>       >
>       >
>       > Any idea what went wrong here ?
>       >
>       > Thanks & Regards
>       > Somnath
>       >
>       > -----Original Message-----
>       > From: Josh Durgin [mailto:josh.durgin@xxxxxxxxxxx]
>       > Sent: Wednesday, September 18, 2013 6:10 PM
>       > To: Somnath Roy
>       > Cc: Sage Weil; 
ceph-devel@xxxxxxxxxxxxxxx; Anirban Ray;
>       > ceph-users@xxxxxxxxxxxxxx
>       > Subject: Re:  Scaling RBD module
>       >
>       > On 09/17/2013 03:30 PM, Somnath Roy wrote:
>       >> Hi,
>       >> I am running Ceph on a 3 node cluster and each of my server
>       node is running 10 OSDs, one for each disk. I have one admin
>       node and all the nodes are connected with 2 X 10G network. One
>       network is for cluster and other one configured as public
>       network.
>       >>
>       >> Here is the status of my cluster.
>       >>
>       >> ~/fio_test# ceph -s
>       >>
>       >>     cluster b2e0b4db-6342-490e-9c28-0aadf0188023
>       >>      health HEALTH_WARN clock skew detected on mon.
>       <server-name-2>, mon. <server-name-3>
>       >>      monmap e1: 3 mons at
>       {<server-name-1>=xxx.xxx.xxx.xxx:6789/0,
>       <server-name-2>=xxx.xxx.xxx.xxx:6789/0,
>       <server-name-3>=xxx.xxx.xxx.xxx:6789/0}, election epoch 64,
>       quorum 0,1,2 <server-name-1>,<server-name-2>,<server-name-3>
>       >>      osdmap e391: 30 osds: 30 up, 30 in
>       >>       pgmap v5202: 30912 pgs: 30912 active+clean; 8494 MB
>       data, 27912 MB used, 11145 GB / 11172 GB avail
>       >>      mdsmap e1: 0/0/1 up
>       >>
>       >>
>       >> I started with rados bench command to benchmark the read
>       performance of this Cluster on a large pool (~10K PGs) and found
>       that each rados client has a limitation. Each client can only
>       drive up to a certain mark. Each server  node cpu utilization
>       shows it is  around 85-90% idle and the admin node (from where
>       rados client is running) is around ~80-85% idle. I am trying
>       with 4K object size.
>       >
>       > Note that rados bench with 4k objects is different from rbd
>       with 4k-sized I/Os - rados bench sends each request to a new
>       object, while rbd objects are 4M by default.
>       >
>       >> Now, I started running more clients on the admin node and the
>       performance is scaling till it hits the client cpu limit. Server
>       still has the cpu of 30-35% idle. With small object size I must
>       say that the ceph per osd cpu utilization is not promising!
>       >>
>       >> After this, I started testing the rados block interface with
>       kernel rbd module from my admin node.
>       >> I have created 8 images mapped on the pool having around 10K
>       PGs and I am not able to scale up the performance by running fio
>       (either by creating a software raid or running on individual
>       /dev/rbd* instances). For example, running multiple fio
>       instances (one in /dev/rbd1 and the other in /dev/rbd2)  the
>       performance I am getting is half of what I am getting if running
>       one instance. Here is my fio job script.
>       >>
>       >> [random-reads]
>       >> ioengine=libaio
>       >> iodepth=32
>       >> filename=/dev/rbd1
>       >> rw=randread
>       >> bs=4k
>       >> direct=1
>       >> size=2G
>       >> numjobs=64
>       >>
>       >> Let me know if I am following the proper procedure or not.
>       >>
>       >> But, If my understanding is correct, kernel rbd module is
>       acting as a client to the cluster and in one admin node I can
>       run only one of such kernel instance.
>       >> If so, I am then limited to the client bottleneck that I
>       stated earlier. The cpu utilization of the server side is around
>       85-90% idle, so, it is clear that client is not driving.
>       >>
>       >> My question is, is there any way to hit the cluster  with
>       more client from a single box while testing the rbd module ?
>       >
>       > You can run multiple librbd instances easily (for example with
>       multiple runs of the rbd bench-write command).
>       >
>       > The kernel rbd driver uses the same rados client instance for
>       multiple block devices by default. There's an option (noshare)
>       to use a new rados client instance for a newly mapped device,
>       but it's not exposed by the rbd cli. You need to use the sysfs
>       interface that 'rbd map' uses instead.
>       >
>       > Once you've used rbd map once on a machine, the kernel will
>       already have the auth key stored, and you can use:
>       >
>       > echo '1.2.3.4:6789 name=admin,key=client.admin,noshare
>       poolname
>       > imagename' > /sys/bus/rbd/add
>       >
>       > Where 1.2.3.4:6789 is the address of a monitor, and you're
>       connecting as client.admin.
>       >
>       > You can use 'rbd unmap' as usual.
>       >
>       > Josh
>       >
>       >
>       > ________________________________
>       >
>       > PLEASE NOTE: The information contained in this electronic mail
>       message is intended only for the use of the designated
>       recipient(s) named above. If the reader of this message is not
>       the intended recipient, you are hereby notified that you have
>       received this message in error and that any review,
>       dissemination, distribution, or copying of this message is
>       strictly prohibited. If you have received this communication in
>       error, please notify the sender by telephone or e-mail (as shown
>       above) immediately and destroy any and all copies of this
>       message in your possession (whether hard copies or
>       electronically stored copies).
>       >
>       >
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to 
majordomo@xxxxxxxxxxxxxxx More majordomo

> info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this
 message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy
 any and all copies of this message in your possession (whether hard copies or electronically stored copies).

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com