Re: Performance doesn't scale well on a full ssd cluster.

Mark Wu <wudx05@xxxxxxxxx> · Fri, 17 Oct 2014 16:52:44 +0800

>> I assume you added more clients and checked that it didn't scale past
>> that?
Yes, correct. 
>> You might look through the list archives; there are a number of
discussions about how and how far you can scale SSD-backed cluster
performance.
I have look at those discussions before, particular the one initiated by Sebastien:  https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg12486.html
I found that Giant can provide better utilization on SSD backend from the thread.  It does improve a lot in the test of 4k random write, compared with Firefly.
In the previous tests with Firefly and 16 osds, I found that the iops of 4k random write on single volume is 14k, and which almost reach the peak of whole cluster.
And the iops on SSD disk is less than 1000, which is far away from the hardware limitation. It looks that ceph doesn't dispatch fast enough.

With 0.86,  the following  options and disabling debugging can improve obviously. 
 throttler perf counter = false
 osd enable op tracker = false

>> Just scanning through the config options you set, you might want to
>> bump up all the filestore and journal queue values a lot farther.

Tried the following options.  It doesn't change. 

ournal_queue_max_ops=3000
objecter_inflight_ops=10240	
journal_max_write_bytes=1048576000
journal_queue_max_bytes=1048576000

ms_dispatch_throttle_bytes=1048576000
objecter_infilght_op_bytes=1048576000
filestore_max_sync_interval=10 

I have a question about the relationship between the write I/O numbers performed on ceph client and the osd disks. From the iostat pasted in the first message,
the write per second is about 5000 and the average request size is 17~22 sectors. Roughly, the write throughtput on all osd nodes is 20 * 512 * 5000 * 30 = 1500MB/s
The replica setting is 2 and the journal and osd data on the same disk, so can we assume the write on ssd disks is 40k (fio client result) * 4k * 2 * 2 = 640MB/s in theory?
I don't understand why he actual write is so high compared with the theoretical value. And the average request size is also more than twice of client request size.
I run blktrace to check if it's merged by the OS I/O scheduler. From the result, it looks that ceph willl merge the requests from client side into bigger ones if possible.
And it also can demonstrate the write on osds (36,141KiB/s * 30 = 1084MB/s)is much more that the theoretical value (129641KB/s * 4 = 518MB/s)  

fio test config and result
[global]
#logging
#write_iops_log=write_iops_log
#write_bw_log=write_bw_log
#write_lat_log=write_lat_log
ioengine=rbd
clientname=admin
pool=volumes
rbdname=image2
invalidate=0    # mandatory
rw=randwrite
bs=4k

[rbd_iodepth128]
iodepth=128
numjobs=3

Run status group 0 (all jobs):  
  WRITE: io=3723.5MB, aggrb=129641KB/s, minb=42961KB/s, maxb=43452KB/s, mint=29404msec, maxt=29410msec

Blktrace result:
==================== Device Overhead ====================

       DEV |       Q2G       G2I       Q2M       I2D       D2C
---------- | --------- --------- --------- --------- ---------
 (  8, 16) |   0.2906%   0.9602%   0.0017%   2.7507%  95.7801%
---------- | --------- --------- --------- --------- ---------
   Overall |   0.2906%   0.9602%   0.0017%   2.7507%  95.7801%

==================== Device Merge Information ====================

       DEV |       #Q       #D   Ratio |   BLKmin   BLKavg   BLKmax    Total
---------- | -------- -------- ------- | -------- -------- -------- --------
 (  8, 16) |   108683   106834     1.0 |        1       18      560  1924765

Total (sdb):
 Reads Queued:           0,        0KiB  Writes Queued:     108,683,  962,312KiB
 Read Dispatches:        0,        0KiB  Write Dispatches:  106,834,  962,313KiB
 Reads Requeued:         0               Writes Requeued:         0
 Reads Completed:        0,        0KiB  Writes Completed:  106,834,  962,313KiB
 Read Merges:            0,        0KiB  Write Merges:        1,849,    8,176KiB
 IO unplugs:        73,163               Timer unplugs:           0

Throughput (R/W): 0KiB/s / 36,141KiB/s
Events (sdb): 792,897 entries

sdb.btt_qhist.dat:  ( collected on queuing, before merging)
req-size num
   8   59403
  16   40522
  32   6057
  48   1102
  64   243
  80   60
  96   37
  112  18
  128  8

On Thu, Oct 16, 2014 at 9:51 AM, Mark Wu <wudx05@xxxxxxxxx> wrote:
> Thanks for the reply. I am not using single client. Writing 5 rbd volumes on
> 3 host can reach the peak. The client is fio and also running on osd nodes.
> But there're no bottlenecks on cpu or network. I also tried running client
> on two non osd servers, but the same result.
>
> 2014 年 10 月 17 日 上午 12:29于 "Gregory Farnum" <greg@xxxxxxxxxxx>写道：
>
>> If you're running a single client to drive these tests, that's your
>> bottleneck. Try running multiple clients and aggregating their numbers.
>> -Greg
>>
>> On Thursday, October 16, 2014, Mark Wu <wudx05@xxxxxxxxx> wrote:
>>>
>>> Hi list,
>>>
>>> During my test, I found ceph doesn't scale as I expected on a 30 osds
>>> cluster.
>>> The following is the information of my setup:
>>> HW configuration:
>>>    15 Dell R720 servers, and each server has:
>>>       Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, 20 cores and
>>> hyper-thread enabled.
>>>       128GB memory
>>>       two Intel 3500 SSD disks, connected with MegaRAID SAS 2208
>>> controller, each disk is configured as raid0 separately.
>>>       bonding with two 10GbE nics, used for both the public network and
>>> cluster network.
>>>
>>> SW configuration:
>>>    OS CentOS 6.5, Kernel 3.17,  Ceph 0.86
>>>    XFS as file system for data.
>>>    each SSD disk has two partitions, one is osd data and the other is osd
>>> journal.
>>>    the pool has 2048 pgs. 2 replicas.
>>>    5 monitors running on 5 of the 15 servers.
>>>    Ceph configuration (in memory debugging options are disabled)
>>>
>>> [osd]
>>> osd data = "">>>> osd journal = /var/lib/ceph/osd/$cluster-$id/journal
>>> osd mkfs type = xfs
>>> osd mkfs options xfs = -f -i size=2048
>>> osd mount options xfs = rw,noatime,logbsize=256k,delaylog
>>> osd journal size = 20480
>>> osd mon heartbeat interval = 30 # Performance tuning filestore
>>> osd_max_backfills = 10
>>> osd_recovery_max_active = 15
>>> merge threshold = 40
>>> filestore split multiple = 8
>>> filestore fd cache size = 1024
>>> osd op threads = 64 # Recovery tuning osd recovery max active = 1 osd max
>>> backfills = 1
>>> osd recovery op priority = 1
>>> throttler perf counter = false
>>> osd enable op tracker = false
>>> filestore_queue_max_ops = 5000
>>> filestore_queue_committing_max_ops = 5000
>>> journal_max_write_entries = 1000
>>> journal_queue_max_ops = 5000
>>> objecter_inflight_ops = 8192
>>>
>>>
>>>   When I test with 7 servers (14 osds),  the maximum iops of 4k random
>>> write I saw is 17k on single volume and 44k on the whole cluster.
>>> I expected the number of 30 osds cluster could approximate 90k. But
>>> unfornately,  I found that with 30 osds, it almost provides the performce
>>> as 14 osds, even worse sometime. I checked the iostat output on all the
>>> nodes, which have similar numbers. It's well distributed but disk
>>> utilization is low.
>>> In the test with 14 osds, I can see higher utilization of disk (80%~90%).
>>> So do you have any tunning suggestion to improve the performace with 30
>>> osds?
>>> Any feedback is appreciated.
>>>
>>>
>>> iostat output:
>>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>>> avgrq-sz avgqu-sz   await  svctm  %util
>>> sda               0.00     0.00    0.00    0.00     0.00     0.00
>>> 0.00     0.00    0.00   0.00   0.00
>>> sdb               0.00    88.50    0.00 5188.00     0.00 93397.00
>>> 18.00     0.90    0.17   0.09  47.85
>>> sdc               0.00   443.50    0.00 5561.50     0.00 97324.00
>>> 17.50     4.06    0.73   0.09  47.90
>>> dm-0              0.00     0.00    0.00    0.00     0.00     0.00
>>> 0.00     0.00    0.00   0.00   0.00
>>> dm-1              0.00     0.00    0.00    0.00     0.00     0.00
>>> 0.00     0.00    0.00   0.00   0.00
>>>
>>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>>> avgrq-sz avgqu-sz   await  svctm  %util
>>> sda               0.00    17.50    0.00   28.00     0.00  3948.00
>>> 141.00     0.01    0.29   0.05   0.15
>>> sdb               0.00    69.50    0.00 4932.00     0.00 87067.50
>>> 17.65     2.27    0.46   0.09  43.45
>>> sdc               0.00    69.00    0.00 4855.50     0.00 105771.50
>>> 21.78     0.95    0.20   0.10  46.40
>>> dm-0              0.00     0.00    0.00    0.00     0.00     0.00
>>> 0.00     0.00    0.00   0.00   0.00
>>> dm-1              0.00     0.00    0.00   42.50     0.00  3948.00
>>> 92.89     0.01    0.19   0.04   0.15
>>>
>>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>>> avgrq-sz avgqu-sz   await  svctm  %util
>>> sda               0.00    12.00    0.00    8.00     0.00   568.00
>>> 71.00     0.00    0.12   0.12   0.10
>>> sdb               0.00    72.50    0.00 5046.50     0.00 113198.50
>>> 22.43     1.09    0.22   0.10  51.40
>>> sdc               0.00    72.50    0.00 4912.00     0.00 91204.50
>>> 18.57     2.25    0.46   0.09  43.60
>>> dm-0              0.00     0.00    0.00    0.00     0.00     0.00
>>> 0.00     0.00    0.00   0.00   0.00
>>> dm-1              0.00     0.00    0.00   18.00     0.00   568.00
>>> 31.56     0.00    0.17   0.06   0.10
>>>
>>>
>>>
>>> Regards,
>>> Mark Wu
>>>
>>
>>
>> --
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
tried running client

> on two non osd servers, but the same result.

>

> 2014 年 10 月 17 日 上午 12:29于 "Gregory Farnum" <greg@xxxxxxxxxxx>写道：

>

>> If you're running a single client to drive these tests, that's your

>> bottleneck. Try running multiple clients and aggregating their numbers.

>> -Greg

>>

>> On Thursday, October 16, 2014, Mark Wu <wudx05@xxxxxxxxx> wrote:

>>>

>>> Hi list,

>>>

>>> During my test, I found ceph doesn't scale as I expected on a 30 osds

>>> cluster.

>>> The following is the information of my setup:

>>> HW configuration:

>>>    15 Dell R720 servers, and each server has:

>>>       Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, 20 cores and

>>> hyper-thread enabled.

>>>       128GB memory

>>>       two Intel 3500 SSD disks, connected with MegaRAID SAS 2208

>>> controller, each disk is configured as raid0 separately.

>>>       bonding with two 10GbE nics, used for both the public network and

>>> cluster network.

>>>

>>> SW configuration:

>>>    OS CentOS 6.5, Kernel 3.17,  Ceph 0.86

>>>    XFS as file system for data.

>>>    each SSD disk has two partitions, one is osd data and the other is osd

>>> journal.

>>>    the pool has 2048 pgs. 2 replicas.

>>>    5 monitors running on 5 of the 15 servers.

>>>    Ceph configuration (in memory debugging options are disabled)

>>>

>>> [osd]

>>> osd data = "">
>>> osd journal = /var/lib/ceph/osd/$cluster-$id/journal

>>> osd mkfs type = xfs

>>> osd mkfs options xfs = -f -i size=2048

>>> osd mount options xfs = rw,noatime,logbsize=256k,delaylog

>>> osd journal size = 20480

>>> osd mon heartbeat interval = 30 # Performance tuning filestore

>>> osd_max_backfills = 10

>>> osd_recovery_max_active = 15

>>> merge threshold = 40

>>> filestore split multiple = 8

>>> filestore fd cache size = 1024

>>> osd op threads = 64 # Recovery tuning osd recovery max active = 1 osd max

>>> backfills = 1

>>> osd recovery op priority = 1

>>> throttler perf counter = false

>>> osd enable op tracker = false

>>> filestore_queue_max_ops = 5000

>>> filestore_queue_committing_max_ops = 5000

>>> journal_max_write_entries = 1000

>>> journal_queue_max_ops = 5000

>>> objecter_inflight_ops = 8192

>>>

>>>

>>>   When I test with 7 servers (14 osds),  the maximum iops of 4k random

>>> write I saw is 17k on single volume and 44k on the whole cluster.

>>> I expected the number of 30 osds cluster could approximate 90k. But

>>> unfornately,  I found that with 30 osds, it almost provides the performce

>>> as 14 osds, even worse sometime. I checked the iostat output on all the

>>> nodes, which have similar numbers. It's well distributed but disk

>>> utilization is low.

>>> In the test with 14 osds, I can see higher utilization of disk (80%~90%).

>>> So do you have any tunning suggestion to improve the performace with 30

>>> osds?

>>> Any feedback is appreciated.

>>>

>>>

>>> iostat output:

>>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s

>>> avgrq-sz avgqu-sz   await  svctm  %util

>>> sda               0.00     0.00    0.00    0.00     0.00     0.00

>>> 0.00     0.00    0.00   0.00   0.00

>>> sdb               0.00    88.50    0.00 5188.00     0.00 93397.00

>>> 18.00     0.90    0.17   0.09  47.85

>>> sdc               0.00   443.50    0.00 5561.50     0.00 97324.00

>>> 17.50     4.06    0.73   0.09  47.90

>>> dm-0              0.00     0.00    0.00    0.00     0.00     0.00

>>> 0.00     0.00    0.00   0.00   0.00

>>> dm-1              0.00     0.00    0.00    0.00     0.00     0.00

>>> 0.00     0.00    0.00   0.00   0.00

>>>

>>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s

>>> avgrq-sz avgqu-sz   await  svctm  %util

>>> sda               0.00    17.50    0.00   28.00     0.00  3948.00

>>> 141.00     0.01    0.29   0.05   0.15

>>> sdb               0.00    69.50    0.00 4932.00     0.00 87067.50

>>> 17.65     2.27    0.46   0.09  43.45

>>> sdc               0.00    69.00    0.00 4855.50     0.00 105771.50

>>> 21.78     0.95    0.20   0.10  46.40

>>> dm-0              0.00     0.00    0.00    0.00     0.00     0.00

>>> 0.00     0.00    0.00   0.00   0.00

>>> dm-1              0.00     0.00    0.00   42.50     0.00  3948.00

>>> 92.89     0.01    0.19   0.04   0.15

>>>

>>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s

>>> avgrq-sz avgqu-sz   await  svctm  %util

>>> sda               0.00    12.00    0.00    8.00     0.00   568.00

>>> 71.00     0.00    0.12   0.12   0.10

>>> sdb               0.00    72.50    0.00 5046.50     0.00 113198.50

>>> 22.43     1.09    0.22   0.10  51.40

>>> sdc               0.00    72.50    0.00 4912.00     0.00 91204.50

>>> 18.57     2.25    0.46   0.09  43.60

>>> dm-0              0.00     0.00    0.00    0.00     0.00     0.00

>>> 0.00     0.00    0.00   0.00   0.00

>>> dm-1              0.00     0.00    0.00   18.00     0.00   568.00

>>> 31.56     0.00    0.17   0.06   0.10

>>>

>>>

>>>

>>> Regards,

>>> Mark Wu

>>>

>>

>>

>> --

>> Software Engineer #42 @ http://inktank.com | http://ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com