Re: Performance doesn't scale well on a full ssd cluster.

Christian Balzer <chibi@xxxxxxx> · Fri, 17 Oct 2014 09:47:02 +0900

Hello (Greg in particular),

On Thu, 16 Oct 2014 10:06:58 -0700 Gregory Farnum wrote:

> [Re-added the list.]
> 
> I assume you added more clients and checked that it didn't scale past
> that? You might look through the list archives; there are a number of
> discussions about how and how far you can scale SSD-backed cluster
> performance.

Indeed there are and the first one (while not SSD backed, but close
enough) I remember is by yours truly:
https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg09537.html

In which you participated as well.

> Just scanning through the config options you set, you might want to
> bump up all the filestore and journal queue values a lot farther.
>
I did that back then, with little to no effect.

Which brings me to another point.
Only a fraction of these parameters (visible when doing a live config dump)
are documented and while one can guess what they probably do/mean and what
their values denote this is not how it should be.
Especially when you expect people to tune these parameters.

Christian

> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> 
> 
> On Thu, Oct 16, 2014 at 9:51 AM, Mark Wu <wudx05@xxxxxxxxx> wrote:
> > Thanks for the reply. I am not using single client. Writing 5 rbd
> > volumes on 3 host can reach the peak. The client is fio and also
> > running on osd nodes. But there're no bottlenecks on cpu or network. I
> > also tried running client on two non osd servers, but the same result.
> >
> > 2014 年 10 月 17 日 上午 12:29于 "Gregory Farnum"
> > <greg@xxxxxxxxxxx>写道：
> >
> >> If you're running a single client to drive these tests, that's your
> >> bottleneck. Try running multiple clients and aggregating their
> >> numbers. -Greg
> >>
> >> On Thursday, October 16, 2014, Mark Wu <wudx05@xxxxxxxxx> wrote:
> >>>
> >>> Hi list,
> >>>
> >>> During my test, I found ceph doesn't scale as I expected on a 30 osds
> >>> cluster.
> >>> The following is the information of my setup:
> >>> HW configuration:
> >>>    15 Dell R720 servers, and each server has:
> >>>       Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, 20 cores and
> >>> hyper-thread enabled.
> >>>       128GB memory
> >>>       two Intel 3500 SSD disks, connected with MegaRAID SAS 2208
> >>> controller, each disk is configured as raid0 separately.
> >>>       bonding with two 10GbE nics, used for both the public network
> >>> and cluster network.
> >>>
> >>> SW configuration:
> >>>    OS CentOS 6.5, Kernel 3.17,  Ceph 0.86
> >>>    XFS as file system for data.
> >>>    each SSD disk has two partitions, one is osd data and the other
> >>> is osd journal.
> >>>    the pool has 2048 pgs. 2 replicas.
> >>>    5 monitors running on 5 of the 15 servers.
> >>>    Ceph configuration (in memory debugging options are disabled)
> >>>
> >>> [osd]
> >>> osd data = /var/lib/ceph/osd/$cluster-$id
> >>> osd journal = /var/lib/ceph/osd/$cluster-$id/journal
> >>> osd mkfs type = xfs
> >>> osd mkfs options xfs = -f -i size=2048
> >>> osd mount options xfs = rw,noatime,logbsize=256k,delaylog
> >>> osd journal size = 20480
> >>> osd mon heartbeat interval = 30 # Performance tuning filestore
> >>> osd_max_backfills = 10
> >>> osd_recovery_max_active = 15
> >>> merge threshold = 40
> >>> filestore split multiple = 8
> >>> filestore fd cache size = 1024
> >>> osd op threads = 64 # Recovery tuning osd recovery max active = 1
> >>> osd max backfills = 1
> >>> osd recovery op priority = 1
> >>> throttler perf counter = false
> >>> osd enable op tracker = false
> >>> filestore_queue_max_ops = 5000
> >>> filestore_queue_committing_max_ops = 5000
> >>> journal_max_write_entries = 1000
> >>> journal_queue_max_ops = 5000
> >>> objecter_inflight_ops = 8192
> >>>
> >>>
> >>>   When I test with 7 servers (14 osds),  the maximum iops of 4k
> >>> random write I saw is 17k on single volume and 44k on the whole
> >>> cluster. I expected the number of 30 osds cluster could approximate
> >>> 90k. But unfornately,  I found that with 30 osds, it almost provides
> >>> the performce as 14 osds, even worse sometime. I checked the iostat
> >>> output on all the nodes, which have similar numbers. It's well
> >>> distributed but disk utilization is low.
> >>> In the test with 14 osds, I can see higher utilization of disk
> >>> (80%~90%). So do you have any tunning suggestion to improve the
> >>> performace with 30 osds?
> >>> Any feedback is appreciated.
> >>>
> >>>
> >>> iostat output:
> >>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> >>> avgrq-sz avgqu-sz   await  svctm  %util
> >>> sda               0.00     0.00    0.00    0.00     0.00     0.00
> >>> 0.00     0.00    0.00   0.00   0.00
> >>> sdb               0.00    88.50    0.00 5188.00     0.00 93397.00
> >>> 18.00     0.90    0.17   0.09  47.85
> >>> sdc               0.00   443.50    0.00 5561.50     0.00 97324.00
> >>> 17.50     4.06    0.73   0.09  47.90
> >>> dm-0              0.00     0.00    0.00    0.00     0.00     0.00
> >>> 0.00     0.00    0.00   0.00   0.00
> >>> dm-1              0.00     0.00    0.00    0.00     0.00     0.00
> >>> 0.00     0.00    0.00   0.00   0.00
> >>>
> >>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> >>> avgrq-sz avgqu-sz   await  svctm  %util
> >>> sda               0.00    17.50    0.00   28.00     0.00  3948.00
> >>> 141.00     0.01    0.29   0.05   0.15
> >>> sdb               0.00    69.50    0.00 4932.00     0.00 87067.50
> >>> 17.65     2.27    0.46   0.09  43.45
> >>> sdc               0.00    69.00    0.00 4855.50     0.00 105771.50
> >>> 21.78     0.95    0.20   0.10  46.40
> >>> dm-0              0.00     0.00    0.00    0.00     0.00     0.00
> >>> 0.00     0.00    0.00   0.00   0.00
> >>> dm-1              0.00     0.00    0.00   42.50     0.00  3948.00
> >>> 92.89     0.01    0.19   0.04   0.15
> >>>
> >>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> >>> avgrq-sz avgqu-sz   await  svctm  %util
> >>> sda               0.00    12.00    0.00    8.00     0.00   568.00
> >>> 71.00     0.00    0.12   0.12   0.10
> >>> sdb               0.00    72.50    0.00 5046.50     0.00 113198.50
> >>> 22.43     1.09    0.22   0.10  51.40
> >>> sdc               0.00    72.50    0.00 4912.00     0.00 91204.50
> >>> 18.57     2.25    0.46   0.09  43.60
> >>> dm-0              0.00     0.00    0.00    0.00     0.00     0.00
> >>> 0.00     0.00    0.00   0.00   0.00
> >>> dm-1              0.00     0.00    0.00   18.00     0.00   568.00
> >>> 31.56     0.00    0.17   0.06   0.10
> >>>
> >>>
> >>>
> >>> Regards,
> >>> Mark Wu
> >>>
> >>
> >>
> >> --
> >> Software Engineer #42 @ http://inktank.com | http://ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com