Re: Performance doesn't scale well on a full ssd cluster.

Gregory Farnum <greg@xxxxxxxxxxx> · Thu, 16 Oct 2014 10:06:58 -0700



[Re-added the list.]

I assume you added more clients and checked that it didn't scale past
that? You might look through the list archives; there are a number of
discussions about how and how far you can scale SSD-backed cluster
performance.
Just scanning through the config options you set, you might want to
bump up all the filestore and journal queue values a lot farther.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Thu, Oct 16, 2014 at 9:51 AM, Mark Wu <wudx05@xxxxxxxxx> wrote:
> Thanks for the reply. I am not using single client. Writing 5 rbd volumes on
> 3 host can reach the peak. The client is fio and also running on osd nodes.
> But there're no bottlenecks on cpu or network. I also tried running client
> on two non osd servers, but the same result.
>
> 2014 年 10 月 17 日 上午 12:29于 "Gregory Farnum" <greg@xxxxxxxxxxx>写道：
>
>> If you're running a single client to drive these tests, that's your
>> bottleneck. Try running multiple clients and aggregating their numbers.
>> -Greg
>>
>> On Thursday, October 16, 2014, Mark Wu <wudx05@xxxxxxxxx> wrote:
>>>
>>> Hi list,
>>>
>>> During my test, I found ceph doesn't scale as I expected on a 30 osds
>>> cluster.
>>> The following is the information of my setup:
>>> HW configuration:
>>>    15 Dell R720 servers, and each server has:
>>>       Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, 20 cores and
>>> hyper-thread enabled.
>>>       128GB memory
>>>       two Intel 3500 SSD disks, connected with MegaRAID SAS 2208
>>> controller, each disk is configured as raid0 separately.
>>>       bonding with two 10GbE nics, used for both the public network and
>>> cluster network.
>>>
>>> SW configuration:
>>>    OS CentOS 6.5, Kernel 3.17,  Ceph 0.86
>>>    XFS as file system for data.
>>>    each SSD disk has two partitions, one is osd data and the other is osd
>>> journal.
>>>    the pool has 2048 pgs. 2 replicas.
>>>    5 monitors running on 5 of the 15 servers.
>>>    Ceph configuration (in memory debugging options are disabled)
>>>
>>> [osd]
>>> osd data = /var/lib/ceph/osd/$cluster-$id
>>> osd journal = /var/lib/ceph/osd/$cluster-$id/journal
>>> osd mkfs type = xfs
>>> osd mkfs options xfs = -f -i size=2048
>>> osd mount options xfs = rw,noatime,logbsize=256k,delaylog
>>> osd journal size = 20480
>>> osd mon heartbeat interval = 30 # Performance tuning filestore
>>> osd_max_backfills = 10
>>> osd_recovery_max_active = 15
>>> merge threshold = 40
>>> filestore split multiple = 8
>>> filestore fd cache size = 1024
>>> osd op threads = 64 # Recovery tuning osd recovery max active = 1 osd max
>>> backfills = 1
>>> osd recovery op priority = 1
>>> throttler perf counter = false
>>> osd enable op tracker = false
>>> filestore_queue_max_ops = 5000
>>> filestore_queue_committing_max_ops = 5000
>>> journal_max_write_entries = 1000
>>> journal_queue_max_ops = 5000
>>> objecter_inflight_ops = 8192
>>>
>>>
>>>   When I test with 7 servers (14 osds),  the maximum iops of 4k random
>>> write I saw is 17k on single volume and 44k on the whole cluster.
>>> I expected the number of 30 osds cluster could approximate 90k. But
>>> unfornately,  I found that with 30 osds, it almost provides the performce
>>> as 14 osds, even worse sometime. I checked the iostat output on all the
>>> nodes, which have similar numbers. It's well distributed but disk
>>> utilization is low.
>>> In the test with 14 osds, I can see higher utilization of disk (80%~90%).
>>> So do you have any tunning suggestion to improve the performace with 30
>>> osds?
>>> Any feedback is appreciated.
>>>
>>>
>>> iostat output:
>>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>>> avgrq-sz avgqu-sz   await  svctm  %util
>>> sda               0.00     0.00    0.00    0.00     0.00     0.00
>>> 0.00     0.00    0.00   0.00   0.00
>>> sdb               0.00    88.50    0.00 5188.00     0.00 93397.00
>>> 18.00     0.90    0.17   0.09  47.85
>>> sdc               0.00   443.50    0.00 5561.50     0.00 97324.00
>>> 17.50     4.06    0.73   0.09  47.90
>>> dm-0              0.00     0.00    0.00    0.00     0.00     0.00
>>> 0.00     0.00    0.00   0.00   0.00
>>> dm-1              0.00     0.00    0.00    0.00     0.00     0.00
>>> 0.00     0.00    0.00   0.00   0.00
>>>
>>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>>> avgrq-sz avgqu-sz   await  svctm  %util
>>> sda               0.00    17.50    0.00   28.00     0.00  3948.00
>>> 141.00     0.01    0.29   0.05   0.15
>>> sdb               0.00    69.50    0.00 4932.00     0.00 87067.50
>>> 17.65     2.27    0.46   0.09  43.45
>>> sdc               0.00    69.00    0.00 4855.50     0.00 105771.50
>>> 21.78     0.95    0.20   0.10  46.40
>>> dm-0              0.00     0.00    0.00    0.00     0.00     0.00
>>> 0.00     0.00    0.00   0.00   0.00
>>> dm-1              0.00     0.00    0.00   42.50     0.00  3948.00
>>> 92.89     0.01    0.19   0.04   0.15
>>>
>>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>>> avgrq-sz avgqu-sz   await  svctm  %util
>>> sda               0.00    12.00    0.00    8.00     0.00   568.00
>>> 71.00     0.00    0.12   0.12   0.10
>>> sdb               0.00    72.50    0.00 5046.50     0.00 113198.50
>>> 22.43     1.09    0.22   0.10  51.40
>>> sdc               0.00    72.50    0.00 4912.00     0.00 91204.50
>>> 18.57     2.25    0.46   0.09  43.60
>>> dm-0              0.00     0.00    0.00    0.00     0.00     0.00
>>> 0.00     0.00    0.00   0.00   0.00
>>> dm-1              0.00     0.00    0.00   18.00     0.00   568.00
>>> 31.56     0.00    0.17   0.06   0.10
>>>
>>>
>>>
>>> Regards,
>>> Mark Wu
>>>
>>
>>
>> --
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com