Re: Scaling RBD module

Travis Rhoden <trhoden@xxxxxxxxx> · Tue, 24 Sep 2013 17:24:57 -0400

On Tue, Sep 24, 2013 at 5:16 PM, Sage Weil <sage@xxxxxxxxxxx> wrote:
> On Tue, 24 Sep 2013, Travis Rhoden wrote:
>> This "noshare" option may have just helped me a ton -- I sure wish I would
>> have asked similar questions sooner, because I have seen the same failure to
>> scale.  =)
>>
>> One question -- when using the "noshare" option (or really, even without it)
>> are there any practical limits on the number of RBDs that can be mounted?  I
>> have servers with ~100 RBDs on them each, and am wondering if I switch them
>> all over to using "noshare" if anything is going to blow up, use a ton more
>> memory, etc.  Even without noshare, are there any known limits to how many
>> RBDs can be mapped?
>
> With noshare each mapped image will appear as a separate client instance,
> which means it will have it's own session with teh monitors and own TCP
> connections to the OSDs.  It may be a viable workaround for now but in
> general I would not recommend it.

Good to know.  We are still playing with CephFS as our ultimate
solution, but in the meantime this may indeed be a good workaround for
me.

>
> I'm very curious what the scaling issue is with the shared client.  Do you
> have a working perf that can capture callgraph information on this
> machine?

Not currently, but I could certainly work on it.  The issue that we
see is basically what the OP showed -- that there seems to be a finite
amount of bandwidth that I can read/write from a machine, regardless
of how many RBDs are involved.  i.e., if I can get 1GB/sec writes on
one RBD when everything else is idle, running the same test on two
RBDs in parallel *from the same machine* ends up with the sum of the
two at ~1GB/sec, split fairly evenly. However, if I do the same thing
and run the same test on two RBDs, each hosted on a separate machine,
I definitely see increased bandwidth.  Monitoring network traffic and
the Ceph OSD nodes seems to imply that they are not overloaded --
there is more bandwidth to be had, the clients just aren't able to
push the data fast enough.  That's why I'm hoping creating a "new"
client for each RBD will improve things.

I'm not going to enable this everywhere just yet, we will test things
on a few RBDs and test, and perhaps enable on some RBDs that are
particularly heavily loaded.

I'll work on the perf capture!

Thanks for the feedback, as always.

 - Travis
>
> sage
>
>>
>> Thanks!
>>
>>  - Travis
>>
>>
>> On Thu, Sep 19, 2013 at 8:03 PM, Somnath Roy <Somnath.Roy@xxxxxxxxxxx>
>> wrote:
>>       Thanks Josh !
>>       I am able to successfully add this noshare option in the image
>>       mapping now. Looking at dmesg output, I found that was indeed
>>       the secret key problem. Block performance is scaling now.
>>
>>       Regards
>>       Somnath
>>
>>       -----Original Message-----
>>       From: ceph-devel-owner@xxxxxxxxxxxxxxx
>>       [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Josh
>>       Durgin
>>       Sent: Thursday, September 19, 2013 12:24 PM
>>       To: Somnath Roy
>>       Cc: Sage Weil; ceph-devel@xxxxxxxxxxxxxxx; Anirban Ray;
>>       ceph-users@xxxxxxxxxxxxxx
>>       Subject: Re:  Scaling RBD module
>>
>>       On 09/19/2013 12:04 PM, Somnath Roy wrote:
>>       > Hi Josh,
>>       > Thanks for the information. I am trying to add the following
>>       but hitting some permission issue.
>>       >
>>       > root@emsclient:/etc# echo
>>       <mon-1>:6789,<mon-2>:6789,<mon-3>:6789
>>       > name=admin,key=client.admin,noshare test_rbd ceph_block_test'
>>       >
>>       > /sys/bus/rbd/add
>>       > -bash: echo: write error: Operation not permitted
>>
>>       If you check dmesg, it will probably show an error trying to
>>       authenticate to the cluster.
>>
>>       Instead of key=client.admin, you can pass the base64 secret
>>       value as shown in 'ceph auth list' with the
>>       secret=XXXXXXXXXXXXXXXXXXXXX option.
>>
>>       BTW, there's a ticket for adding the noshare option to rbd map
>>       so using the sysfs interface like this is never necessary:
>>
>>       http://tracker.ceph.com/issues/6264
>>
>>       Josh
>>
>>       > Here is the contents of rbd directory..
>>       >
>>       > root@emsclient:/sys/bus/rbd# ll
>>       > total 0
>>       > drwxr-xr-x  4 root root    0 Sep 19 11:59 ./
>>       > drwxr-xr-x 30 root root    0 Sep 13 11:41 ../
>>       > --w-------  1 root root 4096 Sep 19 11:59 add
>>       > drwxr-xr-x  2 root root    0 Sep 19 12:03 devices/
>>       > drwxr-xr-x  2 root root    0 Sep 19 12:03 drivers/
>>       > -rw-r--r--  1 root root 4096 Sep 19 12:03 drivers_autoprobe
>>       > --w-------  1 root root 4096 Sep 19 12:03 drivers_probe
>>       > --w-------  1 root root 4096 Sep 19 12:03 remove
>>       > --w-------  1 root root 4096 Sep 19 11:59 uevent
>>       >
>>       >
>>       > I checked even if I am logged in as root , I can't write
>>       anything on /sys.
>>       >
>>       > Here is the Ubuntu version I am using..
>>       >
>>       > root@emsclient:/etc# lsb_release -a
>>       > No LSB modules are available.
>>       > Distributor ID: Ubuntu
>>       > Description:    Ubuntu 13.04
>>       > Release:        13.04
>>       > Codename:       raring
>>       >
>>       > Here is the mount information....
>>       >
>>       > root@emsclient:/etc# mount
>>       > /dev/mapper/emsclient--vg-root on / type ext4
>>       (rw,errors=remount-ro)
>>       > proc on /proc type proc (rw,noexec,nosuid,nodev) sysfs on /sys
>>       type
>>       > sysfs (rw,noexec,nosuid,nodev) none on /sys/fs/cgroup type
>>       tmpfs (rw)
>>       > none on /sys/fs/fuse/connections type fusectl (rw) none on
>>       > /sys/kernel/debug type debugfs (rw) none on
>>       /sys/kernel/security type
>>       > securityfs (rw) udev on /dev type devtmpfs (rw,mode=0755)
>>       devpts on
>>       > /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
>>       > tmpfs on /run type tmpfs (rw,noexec,nosuid,size=10%,mode=0755)
>>       > none on /run/lock type tmpfs
>>       (rw,noexec,nosuid,nodev,size=5242880)
>>       > none on /run/shm type tmpfs (rw,nosuid,nodev) none on
>>       /run/user type
>>       > tmpfs (rw,noexec,nosuid,nodev,size=104857600,mode=0755)
>>       > /dev/sda1 on /boot type ext2 (rw)
>>       > /dev/mapper/emsclient--vg-home on /home type ext4 (rw)
>>       >
>>       >
>>       > Any idea what went wrong here ?
>>       >
>>       > Thanks & Regards
>>       > Somnath
>>       >
>>       > -----Original Message-----
>>       > From: Josh Durgin [mailto:josh.durgin@xxxxxxxxxxx]
>>       > Sent: Wednesday, September 18, 2013 6:10 PM
>>       > To: Somnath Roy
>>       > Cc: Sage Weil; ceph-devel@xxxxxxxxxxxxxxx; Anirban Ray;
>>       > ceph-users@xxxxxxxxxxxxxx
>>       > Subject: Re:  Scaling RBD module
>>       >
>>       > On 09/17/2013 03:30 PM, Somnath Roy wrote:
>>       >> Hi,
>>       >> I am running Ceph on a 3 node cluster and each of my server
>>       node is running 10 OSDs, one for each disk. I have one admin
>>       node and all the nodes are connected with 2 X 10G network. One
>>       network is for cluster and other one configured as public
>>       network.
>>       >>
>>       >> Here is the status of my cluster.
>>       >>
>>       >> ~/fio_test# ceph -s
>>       >>
>>       >>     cluster b2e0b4db-6342-490e-9c28-0aadf0188023
>>       >>      health HEALTH_WARN clock skew detected on mon.
>>       <server-name-2>, mon. <server-name-3>
>>       >>      monmap e1: 3 mons at
>>       {<server-name-1>=xxx.xxx.xxx.xxx:6789/0,
>>       <server-name-2>=xxx.xxx.xxx.xxx:6789/0,
>>       <server-name-3>=xxx.xxx.xxx.xxx:6789/0}, election epoch 64,
>>       quorum 0,1,2 <server-name-1>,<server-name-2>,<server-name-3>
>>       >>      osdmap e391: 30 osds: 30 up, 30 in
>>       >>       pgmap v5202: 30912 pgs: 30912 active+clean; 8494 MB
>>       data, 27912 MB used, 11145 GB / 11172 GB avail
>>       >>      mdsmap e1: 0/0/1 up
>>       >>
>>       >>
>>       >> I started with rados bench command to benchmark the read
>>       performance of this Cluster on a large pool (~10K PGs) and found
>>       that each rados client has a limitation. Each client can only
>>       drive up to a certain mark. Each server  node cpu utilization
>>       shows it is  around 85-90% idle and the admin node (from where
>>       rados client is running) is around ~80-85% idle. I am trying
>>       with 4K object size.
>>       >
>>       > Note that rados bench with 4k objects is different from rbd
>>       with 4k-sized I/Os - rados bench sends each request to a new
>>       object, while rbd objects are 4M by default.
>>       >
>>       >> Now, I started running more clients on the admin node and the
>>       performance is scaling till it hits the client cpu limit. Server
>>       still has the cpu of 30-35% idle. With small object size I must
>>       say that the ceph per osd cpu utilization is not promising!
>>       >>
>>       >> After this, I started testing the rados block interface with
>>       kernel rbd module from my admin node.
>>       >> I have created 8 images mapped on the pool having around 10K
>>       PGs and I am not able to scale up the performance by running fio
>>       (either by creating a software raid or running on individual
>>       /dev/rbd* instances). For example, running multiple fio
>>       instances (one in /dev/rbd1 and the other in /dev/rbd2)  the
>>       performance I am getting is half of what I am getting if running
>>       one instance. Here is my fio job script.
>>       >>
>>       >> [random-reads]
>>       >> ioengine=libaio
>>       >> iodepth=32
>>       >> filename=/dev/rbd1
>>       >> rw=randread
>>       >> bs=4k
>>       >> direct=1
>>       >> size=2G
>>       >> numjobs=64
>>       >>
>>       >> Let me know if I am following the proper procedure or not.
>>       >>
>>       >> But, If my understanding is correct, kernel rbd module is
>>       acting as a client to the cluster and in one admin node I can
>>       run only one of such kernel instance.
>>       >> If so, I am then limited to the client bottleneck that I
>>       stated earlier. The cpu utilization of the server side is around
>>       85-90% idle, so, it is clear that client is not driving.
>>       >>
>>       >> My question is, is there any way to hit the cluster  with
>>       more client from a single box while testing the rbd module ?
>>       >
>>       > You can run multiple librbd instances easily (for example with
>>       multiple runs of the rbd bench-write command).
>>       >
>>       > The kernel rbd driver uses the same rados client instance for
>>       multiple block devices by default. There's an option (noshare)
>>       to use a new rados client instance for a newly mapped device,
>>       but it's not exposed by the rbd cli. You need to use the sysfs
>>       interface that 'rbd map' uses instead.
>>       >
>>       > Once you've used rbd map once on a machine, the kernel will
>>       already have the auth key stored, and you can use:
>>       >
>>       > echo '1.2.3.4:6789 name=admin,key=client.admin,noshare
>>       poolname
>>       > imagename' > /sys/bus/rbd/add
>>       >
>>       > Where 1.2.3.4:6789 is the address of a monitor, and you're
>>       connecting as client.admin.
>>       >
>>       > You can use 'rbd unmap' as usual.
>>       >
>>       > Josh
>>       >
>>       >
>>       > ________________________________
>>       >
>>       > PLEASE NOTE: The information contained in this electronic mail
>>       message is intended only for the use of the designated
>>       recipient(s) named above. If the reader of this message is not
>>       the intended recipient, you are hereby notified that you have
>>       received this message in error and that any review,
>>       dissemination, distribution, or copying of this message is
>>       strictly prohibited. If you have received this communication in
>>       error, please notify the sender by telephone or e-mail (as shown
>>       above) immediately and destroy any and all copies of this
>>       message in your possession (whether hard copies or
>>       electronically stored copies).
>>       >
>>       >
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com