Re: Performance test on Ceph cluster

madhusudhana <madhusudhana.u.acharya@xxxxxxxxx> · Fri, 24 Feb 2012 08:58:42 +0000 (UTC)

Tommi Virtanen <tommi.virtanen <at> dreamhost.com> writes:

> 
> On Wed, Feb 22, 2012 at 23:12, madhusudhana
> <madhusudhana.u.acharya <at> gmail.com> wrote:
> > 1. can you please let me know how I can make only 1 MDS active ?
> 
> You can see that in "ceph -s" output, the "mds" line should have just
> one entry like "0=a=up:active" with the word active.
> 
> You can control that with the "max mds" config option, and at runtime
> with "ceph mds set_max_mds NUM" and "ceph mds stop ID".
> 
> Note, decreasing the number of active MDSes is not currently well
> tested. You might be better off with a fresh cluster, that only ever
> ran one ceph-mds process.
> 
> > 2. BTRFS for all OSD's
> 
> There is currently one known case where btrfs's internal structured
> get fragmented, and its performance starts degrading. You might want
> to make sure you start your test with freshly-mkfs'ed btrfses.
> 
> > 3. All hosts (including OSD) in my ceph cluster are running 3.0.9 ver
> >                [root <at> ceph-node-8 ~]# uname -r
> >                3.0.9
> 
> Well, that's at least in the 3.x series.. Btrfs has had a steady
> stream of fixes, so in general we recommend running the latest stable
> kernel. You might want to try that.
> 
> > 4. All 9 machines are replica of each other. I have imaged them using
> > systemimager. Only difference is 9th node is not a part of CEPH
> > cluster. I mounted ceph cluster to this node using mount -t ceph
> > command
> 
> That's good.
> 
> > 5. All 9 clients are running same version of CentOS and Kernel with
> > 1GigE interface
> 
> > You mean to say, I can have ceph mon/OSD's running in the
> > same machine ? but, in ceph wiki, i have read that its better to
> > have different machines for each mds/mon/osd.
> 
> Yes, I just wanted to make sure you have it set up like that.
> 
> > I assume that ceph uses whatever ethernet interface i have (1GigE)
> > in my system to load balance the cluster in case of node failure and
> > node addition. Won't this uses entire bandwidth during load
> > balancing ? won't this cause bandwidth saturation for clients ?
> 
> Yes. That's why you can set up a separate network for cluster-internal
> communication. See "cluster network" or "cluster addr" vs "public
> network" or "public addr".
> 
> > I would like to know what benchmark I should use to test CEPH ?
> > I want to present the data to my management how CEPH can perform when
> > compared with other file systems (like GlusterFS/NetApp/Lustre)
> 
> You should use the benchmark that matches your actual workload best.
> 
> Please stay active on the mailing list until your results start
> looking good. The more information you can provide, the better we can
> help you.
> 
> We're looking forward to get one of our new hires going, he'll be
> benchmarking Ceph on pretty decent hardware & 10gig network with
> whatever loads we can come up with. That should give you a better idea
> of what to expect, and us what to keep working on.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo <at> vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
Thank you Tommi for your response. 

1. In my cluster, all OSD's are mkfs'ed with btrfs
2. Below is what i can see with ceph -s output. Is that mean, only one MDS
is operation and another one is standby ?
          mds e5: 1/1/1 up {0=ceph-node-1=up:active}, 1 up:standby
3. I will not be able to use new stable kernel bcz of company policy :-(
4. If you don't mind, can you please give me a bit of insight on cluster
network, what it is and how i can configure one for my ceph cluster ?
Will there be a significant performance improvement with this ?
5. I have done some testing with dd on ceph. Below are the results

CASE 1:[root@ceph-node-9 ~]# dd if=/dev/zero of=/mnt/ceph-test/wtest bs=4k 
count=1000000
1000000+0 records in
1000000+0 records out
4096000000 bytes (4.1 GB) copied, 4.04089 seconds, 1.0 GB/s

CASE 2:[root@ceph-node-9 ~]# dd if=/dev/zero of=/mnt/ceph-test/wtest bs=4k 
count=10000000
10000000+0 records in
10000000+0 records out
40960000000 bytes (41 GB) copied, 445.786 seconds, 91.9 MB/s

CASE 3:[root@ceph-node-9 ~]# dd if=/dev/zero of=/mnt/ceph-test/wtest bs=4k 
count=100000000
71414032+0 records in
71414032+0 records out
292511875072 bytes (293 GB) copied, 4116.59 seconds, 71.1 MB/s

As you can see from above output, for 4G file of 4k blocks, speed clocked at 
1GB/s, it gradually decreased when i increased the file size above 10G.
And also, if i run back to back dd with CASE 1 option, the write will 
slow down from 1GB/s to 90MB/s. 

can you please explain whether this behaviour is expected ? if yes, why ?
if not, how i can achieve 1GB/s for all file sizes ?

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html