Hi Somnath! On Tue, 24 Sep 2013, Somnath Roy wrote: > > Hi Sage, > > We did quite a few experiment to see how ceph read performance can scale up. > Here is the summary. > > > > 1. > > First we tried to see how far a single node cluster with one osd can scale > up. We started with cuttlefish release and the entire osd file system is on > the ssd. What we saw with 4K size object and with single rados client with > dedicated 10G network, throughput can't go beyond a certain point. Are you using 'rados bench' to generate this load or something else? We've noticed that individual rados bench commands do not scale beyond a point but have never looked into it; the problem may be in the bench code and not in librados or SimpleMessenger. > We dig through the code and found out SimpleMessenger is opening single > socket connection (per client)to talk to the osd. Also, we saw there is only > one dispatcher Q (Dispatch thread)/ SimpleMessenger to carry these requests > to OSD. We started adding more dispatcher threads in Dispatch Q, rearrange > several locks in the Pipe.cc to identify the bottleneck. What we end up > discovering is that there is bottleneck both in upstream as well as in the > downstream at osd level and changing the locking scheme in io path will > affect lot of other codes (that we don't even know). > > So, we stopped that activity and started workaround the upstream bottleneck > by introducing more clients to the single OSD. What we saw single OSD is > scaling with lot of cpu utilization. To produce ~40K iops (4K) it is taking > almost 12 core of cpu. Just to make sure I understand: the single OSD dispatch queue does not become a problem with multiple clients? Possibilities that come to mind: - DispatchQueue is doing some funny stuff to keep individual clients' messages ordered but to fairly process requests from multiple clients. There could easily be a problem with the per-client queue portion of this. - Pipe's use of MSG_MORE is making the TCP stream efficient... you might try setting 'ms tcp nodelay = false'. - The message encode is happening in the thread that sends messages over the wire. Maybe doing it in send_message() instead of writer() will keep that on a separate core than the thread that's shoveling data into the socket. > Another point, I didn't see this single osd scale with the Dumpling release > with the multiple clients !! Something changed.. What is it with dumpling? > 2. After that, we setup a proper cluster with 3 high performing nodes and > total 30 osds. Here also, we are seeing single rados bech client as well as > rbd client instance is not scaling beyond a certain limit. It is not able to > generate much load as node cpu utilization remains very low. But running > multiple client instance the performance is scaling till hit the cpu limit. > > So, it is pretty clear we are not able to saturate anything with single > client and that's why the 'noshare' option was very helpful to measure the > rbd performance benchmark. I have a single osd/single client level call > grind data attached here. Something from perf that shows a call graph would be more helpful to identify where things are waiting. We haven't done much optimizing at this level at all, so these results aren't entirely surprising. > Now, I am doing the benchmark for radosgw and I think I am stuck with > similar bottleneck here. Could you please confirm that if radosgw also > opening single client instance to the cluster? It is: each radosgw has a single librados client instance. > If so, is there any similar option like 'noshare' in this case ? Here also, > creating multiple radosgw instance on separate nodes the performance is > scaling. No, but > BTW, is there a way to run multiple radosgw to a single node or it has to be > one/node ? yes. You just need to make sure they have different fastcgi sockets they listen on and probably set up a separate web server in front of each one. I think the next step to understanding what is going on is getting the right profiling tools in place so we can see where the client threads are spending their (non-idle and idle) time... sage > > > > Thanks & Regards > > Somnath > > > > > > > > -----Original Message----- > From: ceph-devel-owner@xxxxxxxxxxxxxxx > [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Sage Weil > Sent: Tuesday, September 24, 2013 2:16 PM > To: Travis Rhoden > Cc: Josh Durgin; ceph-devel@xxxxxxxxxxxxxxx; Anirban Ray; > ceph-users@xxxxxxxxxxxxxx > Subject: Re: [ceph-users] Scaling RBD module > > > > On Tue, 24 Sep 2013, Travis Rhoden wrote: > > > This "noshare" option may have just helped me a ton -- I sure wish I > > > would have asked similar questions sooner, because I have seen the > > > same failure to scale. =) > > > > > > One question -- when using the "noshare" option (or really, even > > > without it) are there any practical limits on the number of RBDs that > > > can be mounted? I have servers with ~100 RBDs on them each, and am > > > wondering if I switch them all over to using "noshare" if anything is > > > going to blow up, use a ton more memory, etc. Even without noshare, > > > are there any known limits to how many RBDs can be mapped? > > > > With noshare each mapped image will appear as a separate client instance, > which means it will have it's own session with teh monitors and own TCP > connections to the OSDs. It may be a viable workaround for now but in > general I would not recommend it. > > > > I'm very curious what the scaling issue is with the shared client. Do you > have a working perf that can capture callgraph information on this machine? > > > > sage > > > > > > > > Thanks! > > > > > > - Travis > > > > > > > > > On Thu, Sep 19, 2013 at 8:03 PM, Somnath Roy <Somnath.Roy@xxxxxxxxxxx> > > > wrote: > > > Thanks Josh ! > > > I am able to successfully add this noshare option in the image > > > mapping now. Looking at dmesg output, I found that was indeed > > > the secret key problem. Block performance is scaling now. > > > > > > Regards > > > Somnath > > > > > > -----Original Message----- > > > From: ceph-devel-owner@xxxxxxxxxxxxxxx > > > [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Josh > > > Durgin > > > Sent: Thursday, September 19, 2013 12:24 PM > > > To: Somnath Roy > > > Cc: Sage Weil; ceph-devel@xxxxxxxxxxxxxxx; Anirban Ray; > > > ceph-users@xxxxxxxxxxxxxx > > > Subject: Re: [ceph-users] Scaling RBD module > > > > > > On 09/19/2013 12:04 PM, Somnath Roy wrote: > > > > Hi Josh, > > > > Thanks for the information. I am trying to add the following > > > but hitting some permission issue. > > > > > > > > root@emsclient:/etc# echo > > > <mon-1>:6789,<mon-2>:6789,<mon-3>:6789 > > > > name=admin,key=client.admin,noshare test_rbd ceph_block_test' > > > > > > > > /sys/bus/rbd/add > > > > -bash: echo: write error: Operation not permitted > > > > > > If you check dmesg, it will probably show an error trying to > > > authenticate to the cluster. > > > > > > Instead of key=client.admin, you can pass the base64 secret > > > value as shown in 'ceph auth list' with the > > > secret=XXXXXXXXXXXXXXXXXXXXX option. > > > > > > BTW, there's a ticket for adding the noshare option to rbd map > > > so using the sysfs interface like this is never necessary: > > > > > > http://tracker.ceph.com/issues/6264 > > > > > > Josh > > > > > > > Here is the contents of rbd directory.. > > > > > > > > root@emsclient:/sys/bus/rbd# ll > > > > total 0 > > > > drwxr-xr-x 4 root root 0 Sep 19 11:59 ./ > > > > drwxr-xr-x 30 root root 0 Sep 13 11:41 ../ > > > > --w------- 1 root root 4096 Sep 19 11:59 add > > > > drwxr-xr-x 2 root root 0 Sep 19 12:03 devices/ > > > > drwxr-xr-x 2 root root 0 Sep 19 12:03 drivers/ > > > > -rw-r--r-- 1 root root 4096 Sep 19 12:03 drivers_autoprobe > > > > --w------- 1 root root 4096 Sep 19 12:03 drivers_probe > > > > --w------- 1 root root 4096 Sep 19 12:03 remove > > > > --w------- 1 root root 4096 Sep 19 11:59 uevent > > > > > > > > > > > > I checked even if I am logged in as root , I can't write > > > anything on /sys. > > > > > > > > Here is the Ubuntu version I am using.. > > > > > > > > root@emsclient:/etc# lsb_release -a > > > > No LSB modules are available. > > > > Distributor ID: Ubuntu > > > > Description: Ubuntu 13.04 > > > > Release: 13.04 > > > > Codename: raring > > > > > > > > Here is the mount information.... > > > > > > > > root@emsclient:/etc# mount > > > > /dev/mapper/emsclient--vg-root on / type ext4 > > > (rw,errors=remount-ro) > > > > proc on /proc type proc (rw,noexec,nosuid,nodev) sysfs on /sys > > > type > > > > sysfs (rw,noexec,nosuid,nodev) none on /sys/fs/cgroup type > > > tmpfs (rw) > > > > none on /sys/fs/fuse/connections type fusectl (rw) none on > > > > /sys/kernel/debug type debugfs (rw) none on > > > /sys/kernel/security type > > > > securityfs (rw) udev on /dev type devtmpfs (rw,mode=0755) > > > devpts on > > > > /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620) > > > > tmpfs on /run type tmpfs (rw,noexec,nosuid,size=10%,mode=0755) > > > > none on /run/lock type tmpfs > > > (rw,noexec,nosuid,nodev,size=5242880) > > > > none on /run/shm type tmpfs (rw,nosuid,nodev) none on > > > /run/user type > > > > tmpfs (rw,noexec,nosuid,nodev,size=104857600,mode=0755) > > > > /dev/sda1 on /boot type ext2 (rw) > > > > /dev/mapper/emsclient--vg-home on /home type ext4 (rw) > > > > > > > > > > > > Any idea what went wrong here ? > > > > > > > > Thanks & Regards > > > > Somnath > > > > > > > > -----Original Message----- > > > > From: Josh Durgin [mailto:josh.durgin@xxxxxxxxxxx] > > > > Sent: Wednesday, September 18, 2013 6:10 PM > > > > To: Somnath Roy > > > > Cc: Sage Weil; ceph-devel@xxxxxxxxxxxxxxx; Anirban Ray; > > > > ceph-users@xxxxxxxxxxxxxx > > > > Subject: Re: [ceph-users] Scaling RBD module > > > > > > > > On 09/17/2013 03:30 PM, Somnath Roy wrote: > > > >> Hi, > > > >> I am running Ceph on a 3 node cluster and each of my server > > > node is running 10 OSDs, one for each disk. I have one admin > > > node and all the nodes are connected with 2 X 10G network. One > > > network is for cluster and other one configured as public > > > network. > > > >> > > > >> Here is the status of my cluster. > > > >> > > > >> ~/fio_test# ceph -s > > > >> > > > >> cluster b2e0b4db-6342-490e-9c28-0aadf0188023 > > > >> health HEALTH_WARN clock skew detected on mon. > > > <server-name-2>, mon. <server-name-3> > > > >> monmap e1: 3 mons at > > > {<server-name-1>=xxx.xxx.xxx.xxx:6789/0, > > > <server-name-2>=xxx.xxx.xxx.xxx:6789/0, > > > <server-name-3>=xxx.xxx.xxx.xxx:6789/0}, election epoch 64, > > > quorum 0,1,2 <server-name-1>,<server-name-2>,<server-name-3> > > > >> osdmap e391: 30 osds: 30 up, 30 in > > > >> pgmap v5202: 30912 pgs: 30912 active+clean; 8494 MB > > > data, 27912 MB used, 11145 GB / 11172 GB avail > > > >> mdsmap e1: 0/0/1 up > > > >> > > > >> > > > >> I started with rados bench command to benchmark the read > > > performance of this Cluster on a large pool (~10K PGs) and found > > > that each rados client has a limitation. Each client can only > > > drive up to a certain mark. Each server node cpu utilization > > > shows it is around 85-90% idle and the admin node (from where > > > rados client is running) is around ~80-85% idle. I am trying > > > with 4K object size. > > > > > > > > Note that rados bench with 4k objects is different from rbd > > > with 4k-sized I/Os - rados bench sends each request to a new > > > object, while rbd objects are 4M by default. > > > > > > > >> Now, I started running more clients on the admin node and the > > > performance is scaling till it hits the client cpu limit. Server > > > still has the cpu of 30-35% idle. With small object size I must > > > say that the ceph per osd cpu utilization is not promising! > > > >> > > > >> After this, I started testing the rados block interface with > > > kernel rbd module from my admin node. > > > >> I have created 8 images mapped on the pool having around 10K > > > PGs and I am not able to scale up the performance by running fio > > > (either by creating a software raid or running on individual > > > /dev/rbd* instances). For example, running multiple fio > > > instances (one in /dev/rbd1 and the other in /dev/rbd2) the > > > performance I am getting is half of what I am getting if running > > > one instance. Here is my fio job script. > > > >> > > > >> [random-reads] > > > >> ioengine=libaio > > > >> iodepth=32 > > > >> filename=/dev/rbd1 > > > >> rw=randread > > > >> bs=4k > > > >> direct=1 > > > >> size=2G > > > >> numjobs=64 > > > >> > > > >> Let me know if I am following the proper procedure or not. > > > >> > > > >> But, If my understanding is correct, kernel rbd module is > > > acting as a client to the cluster and in one admin node I can > > > run only one of such kernel instance. > > > >> If so, I am then limited to the client bottleneck that I > > > stated earlier. The cpu utilization of the server side is around > > > 85-90% idle, so, it is clear that client is not driving. > > > >> > > > >> My question is, is there any way to hit the cluster with > > > more client from a single box while testing the rbd module ? > > > > > > > > You can run multiple librbd instances easily (for example with > > > multiple runs of the rbd bench-write command). > > > > > > > > The kernel rbd driver uses the same rados client instance for > > > multiple block devices by default. There's an option (noshare) > > > to use a new rados client instance for a newly mapped device, > > > but it's not exposed by the rbd cli. You need to use the sysfs > > > interface that 'rbd map' uses instead. > > > > > > > > Once you've used rbd map once on a machine, the kernel will > > > already have the auth key stored, and you can use: > > > > > > > > echo '1.2.3.4:6789 name=admin,key=client.admin,noshare > > > poolname > > > > imagename' > /sys/bus/rbd/add > > > > > > > > Where 1.2.3.4:6789 is the address of a monitor, and you're > > > connecting as client.admin. > > > > > > > > You can use 'rbd unmap' as usual. > > > > > > > > Josh > > > > > > > > > > > > ________________________________ > > > > > > > > PLEASE NOTE: The information contained in this electronic mail > > > message is intended only for the use of the designated > > > recipient(s) named above. If the reader of this message is not > > > the intended recipient, you are hereby notified that you have > > > received this message in error and that any review, > > > dissemination, distribution, or copying of this message is > > > strictly prohibited. If you have received this communication in > > > error, please notify the sender by telephone or e-mail (as shown > > > above) immediately and destroy any and all copies of this > > > message in your possession (whether hard copies or > > > electronically stored copies). > > > > > > > > > > > > > > -- > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo > > > info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > _______________________________________________ > > > ceph-users mailing list > > > ceph-users@xxxxxxxxxxxxxx > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > > > > > > ____________________________________________________________________________ > > PLEASE NOTE: The information contained in this electronic mail message is > intended only for the use of the designated recipient(s) named above. If the > reader of this message is not the intended recipient, you are hereby > notified that you have received this message in error and that any review, > dissemination, distribution, or copying of this message is strictly > prohibited. If you have received this communication in error, please notify > the sender by telephone or e-mail (as shown above) immediately and destroy > any and all copies of this message in your possession (whether hard copies > or electronically stored copies). > > > >