Test result Update:
Number of Hosts Maximum single volume IOPS Maximum aggregated IOPS SSD Disk IOPS SSD Disk Utilization
7 14k 45k 9800+ 90%
8 21k 50k 9800+ 90%
9 30k 56k 9800+ 90%
10 40k 54k 8200+ 70%
Note: the disk average request size is about 20 sectors, not same as client side (4k)
I have two questions about the result:
1. No matter how many nodes the cluster has, the backend write throughput is always almost 8 times of client side. Is it normal behavior in Ceph, or caused by some wrong configuration in my setup?
The following data is captured in the 9 hosts test. Roughly, the aggregated backend write throughput is 1000 * 22 * 512 * 2 * 9 = 1980M/s
The client side is 56k * 4 = 244M/s
>> I assume you added more clients and checked that it didn't scale past>> that?Yes, correct.>> You might look through the list archives; there are a number ofdiscussions about how and how far you can scale SSD-backed clusterperformance.I have look at those discussions before, particular the one initiated by Sebastien: https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg12486.htmlI found that Giant can provide better utilization on SSD backend from the thread. It does improve a lot in the test of 4k random write, compared with Firefly.In the previous tests with Firefly and 16 osds, I found that the iops of 4k random write on single volume is 14k, and which almost reach the peak of whole cluster.And the iops on SSD disk is less than 1000, which is far away from the hardware limitation. It looks that ceph doesn't dispatch fast enough.With 0.86, the following options and disabling debugging can improve obviously.throttler perf counter = falseosd enable op tracker = false>> Just scanning through the config options you set, you might want to>> bump up all the filestore and journal queue values a lot farther.Tried the following options. It doesn't change.ournal_queue_max_ops=3000objecter_inflight_ops=10240journal_max_write_bytes=1048576000journal_queue_max_bytes=1048576000ms_dispatch_throttle_bytes=1048576000objecter_infilght_op_bytes=1048576000filestore_max_sync_interval=10I have a question about the relationship between the write I/O numbers performed on ceph client and the osd disks. From the iostat pasted in the first message,the write per second is about 5000 and the average request size is 17~22 sectors. Roughly, the write throughtput on all osd nodes is 20 * 512 * 5000 * 30 = 1500MB/sThe replica setting is 2 and the journal and osd data on the same disk, so can we assume the write on ssd disks is 40k (fio client result) * 4k * 2 * 2 = 640MB/s in theory?I don't understand why he actual write is so high compared with the theoretical value. And the average request size is also more than twice of client request size.I run blktrace to check if it's merged by the OS I/O scheduler. From the result, it looks that ceph willl merge the requests from client side into bigger ones if possible.And it also can demonstrate the write on osds (36,141KiB/s * 30 = 1084MB/s)is much more that the theoretical value (129641KB/s * 4 = 518MB/s)fio test config and result[global]#logging#write_iops_log=write_iops_log#write_bw_log=write_bw_log#write_lat_log=write_lat_logioengine=rbdclientname=adminpool=volumesrbdname=image2invalidate=0 # mandatoryrw=randwritebs=4k[rbd_iodepth128]iodepth=128numjobs=3Run status group 0 (all jobs):WRITE: io=3723.5MB, aggrb=129641KB/s, minb=42961KB/s, maxb=43452KB/s, mint=29404msec, maxt=29410msecBlktrace result:==================== Device Overhead ====================DEV | Q2G G2I Q2M I2D D2C---------- | --------- --------- --------- --------- ---------( 8, 16) | 0.2906% 0.9602% 0.0017% 2.7507% 95.7801%---------- | --------- --------- --------- --------- ---------Overall | 0.2906% 0.9602% 0.0017% 2.7507% 95.7801%==================== Device Merge Information ====================DEV | #Q #D Ratio | BLKmin BLKavg BLKmax Total---------- | -------- -------- ------- | -------- -------- -------- --------( 8, 16) | 108683 106834 1.0 | 1 18 560 1924765Total (sdb):Reads Queued: 0, 0KiB Writes Queued: 108,683, 962,312KiBRead Dispatches: 0, 0KiB Write Dispatches: 106,834, 962,313KiBReads Requeued: 0 Writes Requeued: 0Reads Completed: 0, 0KiB Writes Completed: 106,834, 962,313KiBRead Merges: 0, 0KiB Write Merges: 1,849, 8,176KiBIO unplugs: 73,163 Timer unplugs: 0Throughput (R/W): 0KiB/s / 36,141KiB/sEvents (sdb): 792,897 entriessdb.btt_qhist.dat: ( collected on queuing, before merging)req-size num8 5940316 4052232 605748 110264 24380 6096 37112 18128 8On Thu, Oct 16, 2014 at 9:51 AM, Mark Wu <wudx05@xxxxxxxxx> wrote:> Thanks for the reply. I am not using single client. Writing 5 rbd volumes on> 3 host can reach the peak. The client is fio and also running on osd nodes.> But there're no bottlenecks on cpu or network. I also tried running client> on two non osd servers, but the same result.>> 2014 年 10 月 17 日 上午 12:29于 "Gregory Farnum" <greg@xxxxxxxxxxx>写道:>>> If you're running a single client to drive these tests, that's your>> bottleneck. Try running multiple clients and aggregating their numbers.>> -Greg>>>> On Thursday, October 16, 2014, Mark Wu <wudx05@xxxxxxxxx> wrote:>>>>>> Hi list,>>>>>> During my test, I found ceph doesn't scale as I expected on a 30 osds>>> cluster.>>> The following is the information of my setup:>>> HW configuration:>>> 15 Dell R720 servers, and each server has:>>> Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, 20 cores and>>> hyper-thread enabled.>>> 128GB memory>>> two Intel 3500 SSD disks, connected with MegaRAID SAS 2208>>> controller, each disk is configured as raid0 separately.>>> bonding with two 10GbE nics, used for both the public network and>>> cluster network.>>>>>> SW configuration:>>> OS CentOS 6.5, Kernel 3.17, Ceph 0.86>>> XFS as file system for data.>>> each SSD disk has two partitions, one is osd data and the other is osd>>> journal.>>> the pool has 2048 pgs. 2 replicas.>>> 5 monitors running on 5 of the 15 servers.>>> Ceph configuration (in memory debugging options are disabled)>>>>>> [osd]>>> osd data = "">>>> osd journal = /var/lib/ceph/osd/$cluster-$id/journal>>> osd mkfs type = xfs>>> osd mkfs options xfs = -f -i size=2048>>> osd mount options xfs = rw,noatime,logbsize=256k,delaylog>>> osd journal size = 20480>>> osd mon heartbeat interval = 30 # Performance tuning filestore>>> osd_max_backfills = 10>>> osd_recovery_max_active = 15>>> merge threshold = 40>>> filestore split multiple = 8>>> filestore fd cache size = 1024>>> osd op threads = 64 # Recovery tuning osd recovery max active = 1 osd max>>> backfills = 1>>> osd recovery op priority = 1>>> throttler perf counter = false>>> osd enable op tracker = false>>> filestore_queue_max_ops = 5000>>> filestore_queue_committing_max_ops = 5000>>> journal_max_write_entries = 1000>>> journal_queue_max_ops = 5000>>> objecter_inflight_ops = 8192>>>>>>>>> When I test with 7 servers (14 osds), the maximum iops of 4k random>>> write I saw is 17k on single volume and 44k on the whole cluster.>>> I expected the number of 30 osds cluster could approximate 90k. But>>> unfornately, I found that with 30 osds, it almost provides the performce>>> as 14 osds, even worse sometime. I checked the iostat output on all the>>> nodes, which have similar numbers. It's well distributed but disk>>> utilization is low.>>> In the test with 14 osds, I can see higher utilization of disk (80%~90%).>>> So do you have any tunning suggestion to improve the performace with 30>>> osds?>>> Any feedback is appreciated.>>>>>>>>> iostat output:>>> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s>>> avgrq-sz avgqu-sz await svctm %util>>> sda 0.00 0.00 0.00 0.00 0.00 0.00>>> 0.00 0.00 0.00 0.00 0.00>>> sdb 0.00 88.50 0.00 5188.00 0.00 93397.00>>> 18.00 0.90 0.17 0.09 47.85>>> sdc 0.00 443.50 0.00 5561.50 0.00 97324.00>>> 17.50 4.06 0.73 0.09 47.90>>> dm-0 0.00 0.00 0.00 0.00 0.00 0.00>>> 0.00 0.00 0.00 0.00 0.00>>> dm-1 0.00 0.00 0.00 0.00 0.00 0.00>>> 0.00 0.00 0.00 0.00 0.00>>>>>> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s>>> avgrq-sz avgqu-sz await svctm %util>>> sda 0.00 17.50 0.00 28.00 0.00 3948.00>>> 141.00 0.01 0.29 0.05 0.15>>> sdb 0.00 69.50 0.00 4932.00 0.00 87067.50>>> 17.65 2.27 0.46 0.09 43.45>>> sdc 0.00 69.00 0.00 4855.50 0.00 105771.50>>> 21.78 0.95 0.20 0.10 46.40>>> dm-0 0.00 0.00 0.00 0.00 0.00 0.00>>> 0.00 0.00 0.00 0.00 0.00>>> dm-1 0.00 0.00 0.00 42.50 0.00 3948.00>>> 92.89 0.01 0.19 0.04 0.15>>>>>> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s>>> avgrq-sz avgqu-sz await svctm %util>>> sda 0.00 12.00 0.00 8.00 0.00 568.00>>> 71.00 0.00 0.12 0.12 0.10>>> sdb 0.00 72.50 0.00 5046.50 0.00 113198.50>>> 22.43 1.09 0.22 0.10 51.40>>> sdc 0.00 72.50 0.00 4912.00 0.00 91204.50>>> 18.57 2.25 0.46 0.09 43.60>>> dm-0 0.00 0.00 0.00 0.00 0.00 0.00>>> 0.00 0.00 0.00 0.00 0.00>>> dm-1 0.00 0.00 0.00 18.00 0.00 568.00>>> 31.56 0.00 0.17 0.06 0.10>>>>>>>>>>>> Regards,>>> Mark Wu>>>>>>>>> -->> Software Engineer #42 @ http://inktank.com | http://ceph.comtried running client
> on two non osd servers, but the same result.
>
> 2014 年 10 月 17 日 上午 12:29于 "Gregory Farnum" <greg@xxxxxxxxxxx>写道:
>
>> If you're running a single client to drive these tests, that's your
>> bottleneck. Try running multiple clients and aggregating their numbers.
>> -Greg
>>
>> On Thursday, October 16, 2014, Mark Wu <wudx05@xxxxxxxxx> wrote:
>>>
>>> Hi list,
>>>
>>> During my test, I found ceph doesn't scale as I expected on a 30 osds
>>> cluster.
>>> The following is the information of my setup:
>>> HW configuration:
>>> 15 Dell R720 servers, and each server has:
>>> Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, 20 cores and
>>> hyper-thread enabled.
>>> 128GB memory
>>> two Intel 3500 SSD disks, connected with MegaRAID SAS 2208
>>> controller, each disk is configured as raid0 separately.
>>> bonding with two 10GbE nics, used for both the public network and
>>> cluster network.
>>>
>>> SW configuration:
>>> OS CentOS 6.5, Kernel 3.17, Ceph 0.86
>>> XFS as file system for data.
>>> each SSD disk has two partitions, one is osd data and the other is osd
>>> journal.
>>> the pool has 2048 pgs. 2 replicas.
>>> 5 monitors running on 5 of the 15 servers.
>>> Ceph configuration (in memory debugging options are disabled)
>>>
>>> [osd]
>>> osd data = ""> >>> osd journal = /var/lib/ceph/osd/$cluster-$id/journal
>>> osd mkfs type = xfs
>>> osd mkfs options xfs = -f -i size=2048
>>> osd mount options xfs = rw,noatime,logbsize=256k,delaylog
>>> osd journal size = 20480
>>> osd mon heartbeat interval = 30 # Performance tuning filestore
>>> osd_max_backfills = 10
>>> osd_recovery_max_active = 15
>>> merge threshold = 40
>>> filestore split multiple = 8
>>> filestore fd cache size = 1024
>>> osd op threads = 64 # Recovery tuning osd recovery max active = 1 osd max
>>> backfills = 1
>>> osd recovery op priority = 1
>>> throttler perf counter = false
>>> osd enable op tracker = false
>>> filestore_queue_max_ops = 5000
>>> filestore_queue_committing_max_ops = 5000
>>> journal_max_write_entries = 1000
>>> journal_queue_max_ops = 5000
>>> objecter_inflight_ops = 8192
>>>
>>>
>>> When I test with 7 servers (14 osds), the maximum iops of 4k random
>>> write I saw is 17k on single volume and 44k on the whole cluster.
>>> I expected the number of 30 osds cluster could approximate 90k. But
>>> unfornately, I found that with 30 osds, it almost provides the performce
>>> as 14 osds, even worse sometime. I checked the iostat output on all the
>>> nodes, which have similar numbers. It's well distributed but disk
>>> utilization is low.
>>> In the test with 14 osds, I can see higher utilization of disk (80%~90%).
>>> So do you have any tunning suggestion to improve the performace with 30
>>> osds?
>>> Any feedback is appreciated.
>>>
>>>
>>> iostat output:
>>> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
>>> avgrq-sz avgqu-sz await svctm %util
>>> sda 0.00 0.00 0.00 0.00 0.00 0.00
>>> 0.00 0.00 0.00 0.00 0.00
>>> sdb 0.00 88.50 0.00 5188.00 0.00 93397.00
>>> 18.00 0.90 0.17 0.09 47.85
>>> sdc 0.00 443.50 0.00 5561.50 0.00 97324.00
>>> 17.50 4.06 0.73 0.09 47.90
>>> dm-0 0.00 0.00 0.00 0.00 0.00 0.00
>>> 0.00 0.00 0.00 0.00 0.00
>>> dm-1 0.00 0.00 0.00 0.00 0.00 0.00
>>> 0.00 0.00 0.00 0.00 0.00
>>>
>>> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
>>> avgrq-sz avgqu-sz await svctm %util
>>> sda 0.00 17.50 0.00 28.00 0.00 3948.00
>>> 141.00 0.01 0.29 0.05 0.15
>>> sdb 0.00 69.50 0.00 4932.00 0.00 87067.50
>>> 17.65 2.27 0.46 0.09 43.45
>>> sdc 0.00 69.00 0.00 4855.50 0.00 105771.50
>>> 21.78 0.95 0.20 0.10 46.40
>>> dm-0 0.00 0.00 0.00 0.00 0.00 0.00
>>> 0.00 0.00 0.00 0.00 0.00
>>> dm-1 0.00 0.00 0.00 42.50 0.00 3948.00
>>> 92.89 0.01 0.19 0.04 0.15
>>>
>>> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
>>> avgrq-sz avgqu-sz await svctm %util
>>> sda 0.00 12.00 0.00 8.00 0.00 568.00
>>> 71.00 0.00 0.12 0.12 0.10
>>> sdb 0.00 72.50 0.00 5046.50 0.00 113198.50
>>> 22.43 1.09 0.22 0.10 51.40
>>> sdc 0.00 72.50 0.00 4912.00 0.00 91204.50
>>> 18.57 2.25 0.46 0.09 43.60
>>> dm-0 0.00 0.00 0.00 0.00 0.00 0.00
>>> 0.00 0.00 0.00 0.00 0.00
>>> dm-1 0.00 0.00 0.00 18.00 0.00 568.00
>>> 31.56 0.00 0.17 0.06 0.10
>>>
>>>
>>>
>>> Regards,
>>> Mark Wu
>>>
>>
>>
>> --
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
- Follow-Ups:
- Re: Performance doesn't scale well on a full ssd cluster.
- From: Mark Nelson
- Re: Performance doesn't scale well on a full ssd cluster.
- References:
- Performance doesn't scale well on a full ssd cluster.
- From: Mark Wu
- Re: Performance doesn't scale well on a full ssd cluster.
- From: Gregory Farnum
- Re: Performance doesn't scale well on a full ssd cluster.
- From: Gregory Farnum
- Re: Performance doesn't scale well on a full ssd cluster.
- From: Mark Wu
- Performance doesn't scale well on a full ssd cluster.
- Prev by Date: Re: real beginner question
- Next by Date: recovery process stops
- Previous by thread: Re: Performance doesn't scale well on a full ssd cluster.
- Next by thread: Re: Performance doesn't scale well on a full ssd cluster.
- Index(es):