Re: Scaling RBD module

Sage Weil <sage@xxxxxxxxxxx> · Tue, 24 Sep 2013 15:47:18 -0700 (PDT)

Hi Somnath!

On Tue, 24 Sep 2013, Somnath Roy wrote:
> 
> Hi Sage,
> 
> We did quite a few experiment to see how ceph read performance can scale up.
> Here is the summary.
> 
>  
> 
> 1.
> 
> First we tried to see how far a single node cluster with one osd can scale
> up. We started with cuttlefish release and the entire osd file system is on
> the ssd. What we saw with 4K size object and with single rados client with
> dedicated 10G network, throughput can't go beyond a certain point.

Are you using 'rados bench' to generate this load or something else?  
We've noticed that individual rados bench commands do not scale beyond a 
point but have never looked into it; the problem may be in the bench code 
and not in librados or SimpleMessenger.

> We dig through the code and found out SimpleMessenger is opening single
> socket connection (per client)to talk to the osd. Also, we saw there is only
> one dispatcher Q (Dispatch thread)/ SimpleMessenger to carry these requests
> to OSD. We started adding more dispatcher threads in Dispatch Q, rearrange
> several locks in the Pipe.cc to identify the bottleneck. What we end up
> discovering is that there is bottleneck both in upstream as well as in the
> downstream at osd level and changing the locking scheme in io path  will
> affect lot of other codes (that we don't even know).
> 
> So, we stopped that activity and started workaround the upstream bottleneck
> by introducing more clients to the single OSD. What we saw single OSD is
> scaling with lot of cpu utilization. To produce ~40K iops (4K) it is taking
> almost 12 core of cpu.

Just to make sure I understand: the single OSD dispatch queue does not 
become a problem with multiple clients?

Possibilities that come to mind:

- DispatchQueue is doing some funny stuff to keep individual clients' 
messages ordered but to fairly process requests from multiple clients.  
There could easily be a problem with the per-client queue portion of this.

- Pipe's use of MSG_MORE is making the TCP stream efficient... you might 
try setting 'ms tcp nodelay = false'.

- The message encode is happening in the thread that sends messages over 
the wire.  Maybe doing it in send_message() instead of writer() will keep 
that on a separate core than the thread that's shoveling data into the 
socket.

> Another point, I didn't see this single osd scale with the Dumpling release
> with the multiple clients !! Something changed..

What is it with dumpling?

> 2.   After that, we setup a proper cluster with 3 high performing nodes and
> total 30 osds. Here also, we are seeing single rados bech client as well as
> rbd client instance is not scaling beyond a certain limit. It is not able to
> generate much load as node cpu utilization remains very low. But running
> multiple client instance the performance is scaling till hit the cpu limit.
> 
> So, it is pretty clear we are not able to saturate anything with single
> client and that's why the 'noshare' option was very helpful to measure the
> rbd performance benchmark. I have a single osd/single client level call
> grind  data attached here.

Something from perf that shows a call graph would be more helpful to 
identify where things are waiting.  We haven't done much optimizing at 
this level at all, so these results aren't entirely surprising.

> Now, I am doing the benchmark for radosgw and I think I am stuck with
> similar bottleneck here. Could you please confirm that if radosgw also
> opening single client instance to the cluster?                                                                         

It is: each radosgw has a single librados client instance.

> If so, is there any similar option like 'noshare' in this case ? Here also,
> creating multiple radosgw instance on separate nodes the performance is
> scaling.

No, but

> BTW, is there a way to run multiple radosgw to a single node or it has to be
> one/node ?

yes.  You just need to make sure they have different fastcgi sockets they 
listen on and probably set up a separate web server in front of each one.

I think the next step to understanding what is going on is getting the 
right profiling tools in place so we can see where the client threads are 
spending their (non-idle and idle) time...

sage

> 
>  
> 
> Thanks & Regards
> 
> Somnath
> 
>  
> 
>    
> 
>  
> 
> -----Original Message-----
> From: ceph-devel-owner@xxxxxxxxxxxxxxx
> [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Sage Weil
> Sent: Tuesday, September 24, 2013 2:16 PM
> To: Travis Rhoden
> Cc: Josh Durgin; ceph-devel@xxxxxxxxxxxxxxx; Anirban Ray;
> ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  Scaling RBD module
> 
>  
> 
> On Tue, 24 Sep 2013, Travis Rhoden wrote:
> 
> > This "noshare" option may have just helped me a ton -- I sure wish I
> 
> > would have asked similar questions sooner, because I have seen the
> 
> > same failure to scale.  =)
> 
> >
> 
> > One question -- when using the "noshare" option (or really, even
> 
> > without it) are there any practical limits on the number of RBDs that
> 
> > can be mounted?  I have servers with ~100 RBDs on them each, and am
> 
> > wondering if I switch them all over to using "noshare" if anything is
> 
> > going to blow up, use a ton more memory, etc.  Even without noshare,
> 
> > are there any known limits to how many RBDs can be mapped?
> 
>  
> 
> With noshare each mapped image will appear as a separate client instance,
> which means it will have it's own session with teh monitors and own TCP
> connections to the OSDs.  It may be a viable workaround for now but in
> general I would not recommend it.
> 
>  
> 
> I'm very curious what the scaling issue is with the shared client.  Do you
> have a working perf that can capture callgraph information on this machine?
> 
>  
> 
> sage
> 
>  
> 
> >
> 
> > Thanks!
> 
> >
> 
> >  - Travis
> 
> >
> 
> >
> 
> > On Thu, Sep 19, 2013 at 8:03 PM, Somnath Roy <Somnath.Roy@xxxxxxxxxxx>
> 
> > wrote:
> 
> >       Thanks Josh !
> 
> >       I am able to successfully add this noshare option in the image
> 
> >       mapping now. Looking at dmesg output, I found that was indeed
> 
> >       the secret key problem. Block performance is scaling now.
> 
> >
> 
> >       Regards
> 
> >       Somnath
> 
> >
> 
> >       -----Original Message-----
> 
> >       From: ceph-devel-owner@xxxxxxxxxxxxxxx
> 
> >       [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Josh
> 
> >       Durgin
> 
> >       Sent: Thursday, September 19, 2013 12:24 PM
> 
> >       To: Somnath Roy
> 
> >       Cc: Sage Weil; ceph-devel@xxxxxxxxxxxxxxx; Anirban Ray;
> 
> >       ceph-users@xxxxxxxxxxxxxx
> 
> >       Subject: Re:  Scaling RBD module
> 
> >
> 
> >       On 09/19/2013 12:04 PM, Somnath Roy wrote:
> 
> >       > Hi Josh,
> 
> >       > Thanks for the information. I am trying to add the following
> 
> >       but hitting some permission issue.
> 
> >       >
> 
> >       > root@emsclient:/etc# echo
> 
> >       <mon-1>:6789,<mon-2>:6789,<mon-3>:6789
> 
> >       > name=admin,key=client.admin,noshare test_rbd ceph_block_test'
> 
> >       >
> 
> >       > /sys/bus/rbd/add
> 
> >       > -bash: echo: write error: Operation not permitted
> 
> >
> 
> >       If you check dmesg, it will probably show an error trying to
> 
> >       authenticate to the cluster.
> 
> >
> 
> >       Instead of key=client.admin, you can pass the base64 secret
> 
> >       value as shown in 'ceph auth list' with the
> 
> >       secret=XXXXXXXXXXXXXXXXXXXXX option.
> 
> >
> 
> >       BTW, there's a ticket for adding the noshare option to rbd map
> 
> >       so using the sysfs interface like this is never necessary:
> 
> >
> 
> >       http://tracker.ceph.com/issues/6264
> 
> >
> 
> >       Josh
> 
> >
> 
> >       > Here is the contents of rbd directory..
> 
> >       >
> 
> >       > root@emsclient:/sys/bus/rbd# ll
> 
> >       > total 0
> 
> >       > drwxr-xr-x  4 root root    0 Sep 19 11:59 ./
> 
> >       > drwxr-xr-x 30 root root    0 Sep 13 11:41 ../
> 
> >       > --w-------  1 root root 4096 Sep 19 11:59 add
> 
> >       > drwxr-xr-x  2 root root    0 Sep 19 12:03 devices/
> 
> >       > drwxr-xr-x  2 root root    0 Sep 19 12:03 drivers/
> 
> >       > -rw-r--r--  1 root root 4096 Sep 19 12:03 drivers_autoprobe
> 
> >       > --w-------  1 root root 4096 Sep 19 12:03 drivers_probe
> 
> >       > --w-------  1 root root 4096 Sep 19 12:03 remove
> 
> >       > --w-------  1 root root 4096 Sep 19 11:59 uevent
> 
> >       >
> 
> >       >
> 
> >       > I checked even if I am logged in as root , I can't write
> 
> >       anything on /sys.
> 
> >       >
> 
> >       > Here is the Ubuntu version I am using..
> 
> >       >
> 
> >       > root@emsclient:/etc# lsb_release -a
> 
> >       > No LSB modules are available.
> 
> >       > Distributor ID: Ubuntu
> 
> >       > Description:    Ubuntu 13.04
> 
> >       > Release:        13.04
> 
> >       > Codename:       raring
> 
> >       >
> 
> >       > Here is the mount information....
> 
> >       >
> 
> >       > root@emsclient:/etc# mount
> 
> >       > /dev/mapper/emsclient--vg-root on / type ext4
> 
> >       (rw,errors=remount-ro)
> 
> >       > proc on /proc type proc (rw,noexec,nosuid,nodev) sysfs on /sys
> 
> >       type
> 
> >       > sysfs (rw,noexec,nosuid,nodev) none on /sys/fs/cgroup type
> 
> >       tmpfs (rw)
> 
> >       > none on /sys/fs/fuse/connections type fusectl (rw) none on
> 
> >       > /sys/kernel/debug type debugfs (rw) none on
> 
> >       /sys/kernel/security type
> 
> >       > securityfs (rw) udev on /dev type devtmpfs (rw,mode=0755)
> 
> >       devpts on
> 
> >       > /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
> 
> >       > tmpfs on /run type tmpfs (rw,noexec,nosuid,size=10%,mode=0755)
> 
> >       > none on /run/lock type tmpfs
> 
> >       (rw,noexec,nosuid,nodev,size=5242880)
> 
> >       > none on /run/shm type tmpfs (rw,nosuid,nodev) none on
> 
> >       /run/user type
> 
> >       > tmpfs (rw,noexec,nosuid,nodev,size=104857600,mode=0755)
> 
> >       > /dev/sda1 on /boot type ext2 (rw)
> 
> >       > /dev/mapper/emsclient--vg-home on /home type ext4 (rw)
> 
> >       >
> 
> >       >
> 
> >       > Any idea what went wrong here ?
> 
> >       >
> 
> >       > Thanks & Regards
> 
> >       > Somnath
> 
> >       >
> 
> >       > -----Original Message-----
> 
> >       > From: Josh Durgin [mailto:josh.durgin@xxxxxxxxxxx]
> 
> >       > Sent: Wednesday, September 18, 2013 6:10 PM
> 
> >       > To: Somnath Roy
> 
> >       > Cc: Sage Weil; ceph-devel@xxxxxxxxxxxxxxx; Anirban Ray;
> 
> >       > ceph-users@xxxxxxxxxxxxxx
> 
> >       > Subject: Re:  Scaling RBD module
> 
> >       >
> 
> >       > On 09/17/2013 03:30 PM, Somnath Roy wrote:
> 
> >       >> Hi,
> 
> >       >> I am running Ceph on a 3 node cluster and each of my server
> 
> >       node is running 10 OSDs, one for each disk. I have one admin
> 
> >       node and all the nodes are connected with 2 X 10G network. One
> 
> >       network is for cluster and other one configured as public
> 
> >       network.
> 
> >       >>
> 
> >       >> Here is the status of my cluster.
> 
> >       >>
> 
> >       >> ~/fio_test# ceph -s
> 
> >       >>
> 
> >       >>     cluster b2e0b4db-6342-490e-9c28-0aadf0188023
> 
> >       >>      health HEALTH_WARN clock skew detected on mon.
> 
> >       <server-name-2>, mon. <server-name-3>
> 
> >       >>      monmap e1: 3 mons at
> 
> >       {<server-name-1>=xxx.xxx.xxx.xxx:6789/0,
> 
> >       <server-name-2>=xxx.xxx.xxx.xxx:6789/0,
> 
> >       <server-name-3>=xxx.xxx.xxx.xxx:6789/0}, election epoch 64,
> 
> >       quorum 0,1,2 <server-name-1>,<server-name-2>,<server-name-3>
> 
> >       >>      osdmap e391: 30 osds: 30 up, 30 in
> 
> >       >>       pgmap v5202: 30912 pgs: 30912 active+clean; 8494 MB
> 
> >       data, 27912 MB used, 11145 GB / 11172 GB avail
> 
> >       >>      mdsmap e1: 0/0/1 up
> 
> >       >>
> 
> >       >>
> 
> >       >> I started with rados bench command to benchmark the read
> 
> >       performance of this Cluster on a large pool (~10K PGs) and found
> 
> >       that each rados client has a limitation. Each client can only
> 
> >       drive up to a certain mark. Each server  node cpu utilization
> 
> >       shows it is  around 85-90% idle and the admin node (from where
> 
> >       rados client is running) is around ~80-85% idle. I am trying
> 
> >       with 4K object size.
> 
> >       >
> 
> >       > Note that rados bench with 4k objects is different from rbd
> 
> >       with 4k-sized I/Os - rados bench sends each request to a new
> 
> >       object, while rbd objects are 4M by default.
> 
> >       >
> 
> >       >> Now, I started running more clients on the admin node and the
> 
> >       performance is scaling till it hits the client cpu limit. Server
> 
> >       still has the cpu of 30-35% idle. With small object size I must
> 
> >       say that the ceph per osd cpu utilization is not promising!
> 
> >       >>
> 
> >       >> After this, I started testing the rados block interface with
> 
> >       kernel rbd module from my admin node.
> 
> >       >> I have created 8 images mapped on the pool having around 10K
> 
> >       PGs and I am not able to scale up the performance by running fio
> 
> >       (either by creating a software raid or running on individual
> 
> >       /dev/rbd* instances). For example, running multiple fio
> 
> >       instances (one in /dev/rbd1 and the other in /dev/rbd2)  the
> 
> >       performance I am getting is half of what I am getting if running
> 
> >       one instance. Here is my fio job script.
> 
> >       >>
> 
> >       >> [random-reads]
> 
> >       >> ioengine=libaio
> 
> >       >> iodepth=32
> 
> >       >> filename=/dev/rbd1
> 
> >       >> rw=randread
> 
> >       >> bs=4k
> 
> >       >> direct=1
> 
> >       >> size=2G
> 
> >       >> numjobs=64
> 
> >       >>
> 
> >       >> Let me know if I am following the proper procedure or not.
> 
> >       >>
> 
> >       >> But, If my understanding is correct, kernel rbd module is
> 
> >       acting as a client to the cluster and in one admin node I can
> 
> >       run only one of such kernel instance.
> 
> >       >> If so, I am then limited to the client bottleneck that I
> 
> >       stated earlier. The cpu utilization of the server side is around
> 
> >       85-90% idle, so, it is clear that client is not driving.
> 
> >       >>
> 
> >       >> My question is, is there any way to hit the cluster  with
> 
> >       more client from a single box while testing the rbd module ?
> 
> >       >
> 
> >       > You can run multiple librbd instances easily (for example with
> 
> >       multiple runs of the rbd bench-write command).
> 
> >       >
> 
> >       > The kernel rbd driver uses the same rados client instance for
> 
> >       multiple block devices by default. There's an option (noshare)
> 
> >       to use a new rados client instance for a newly mapped device,
> 
> >       but it's not exposed by the rbd cli. You need to use the sysfs
> 
> >       interface that 'rbd map' uses instead.
> 
> >       >
> 
> >       > Once you've used rbd map once on a machine, the kernel will
> 
> >       already have the auth key stored, and you can use:
> 
> >       >
> 
> >       > echo '1.2.3.4:6789 name=admin,key=client.admin,noshare
> 
> >       poolname
> 
> >       > imagename' > /sys/bus/rbd/add
> 
> >       >
> 
> >       > Where 1.2.3.4:6789 is the address of a monitor, and you're
> 
> >       connecting as client.admin.
> 
> >       >
> 
> >       > You can use 'rbd unmap' as usual.
> 
> >       >
> 
> >       > Josh
> 
> >       >
> 
> >       >
> 
> >       > ________________________________
> 
> >       >
> 
> >       > PLEASE NOTE: The information contained in this electronic mail
> 
> >       message is intended only for the use of the designated
> 
> >       recipient(s) named above. If the reader of this message is not
> 
> >       the intended recipient, you are hereby notified that you have
> 
> >       received this message in error and that any review,
> 
> >       dissemination, distribution, or copying of this message is
> 
> >       strictly prohibited. If you have received this communication in
> 
> >       error, please notify the sender by telephone or e-mail (as shown
> 
> >       above) immediately and destroy any and all copies of this
> 
> >       message in your possession (whether hard copies or
> 
> >       electronically stored copies).
> 
> >       >
> 
> >       >
> 
> >
> 
> > --
> 
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> 
> > in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
> 
> > info at  http://vger.kernel.org/majordomo-info.html
> 
> >
> 
> >
> 
> > _______________________________________________
> 
> > ceph-users mailing list
> 
> > ceph-users@xxxxxxxxxxxxxx
> 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> >
> 
> >
> 
> >
> 
> >
> 
> 
> ____________________________________________________________________________
> 
> PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If the
> reader of this message is not the intended recipient, you are hereby
> notified that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly
> prohibited. If you have received this communication in error, please notify
> the sender by telephone or e-mail (as shown above) immediately and destroy
> any and all copies of this message in your possession (whether hard copies
> or electronically stored copies).
> 
> 
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com