Re: Performance doesn't scale well on a full ssd cluster.

Mark Wu <wudx05@xxxxxxxxx> · Mon, 20 Oct 2014 22:28:11 +0800

2014-10-20 21:04 GMT+08:00 Mark Nelson <mark.nelson@xxxxxxxxxxx>:
On 10/20/2014 06:27 AM, Mark Wu wrote:

Test result Update:

Number of Hosts   Maximum single volume IOPS   Maximum aggregated IOPS

  SSD Disk IOPS     SSD Disk Utilization

7                              14k 45k          9800+                    90%

8                              21k

  50k                                      9800+                    90%

9                              30k

  56k                                      9800+ 90%

10                            40k

  54k                                      8200+                    70%

Note:  the disk average request size is about 20 sectors, not same as

client side (4k)

I have two questions about the result:

1. No matter how many nodes the cluster has,  the backend write

throughput is always almost 8 times of client side.  Is it normal

behavior in Ceph,  or caused by some wrong configuration in my setup?

Are you counting journal writes and replication into this?  Also note that journal writes will be slightly larger and padded to a 4K boundary for each write due to header information.  I suspect for coalesced journal writes we may be able to pack the headers together to reduce this overhead.

Yes, the journal writes and replication are counted into backend writes.  Each ssd disk has two partitions: the raw one is used for journal and the one formatted as xfs is used osd data. The replica setting is 2. 
So considering the journal writes and replication,  I expect the writes on backend is 4 times of client side.  From the perspective of disk utilization,  it's good because it's already close to the physical limitation.
But the overhead is too big. Is it possible to try your idea without modifying code?  If yes, I am glad to give it a try.

The following data is captured in the 9 hosts test.  Roughly, the

aggregated backend write throughput is 1000 * 22 * 512  * 2 * 9 = 1980M/s

The client side is 56k * 4 = 244M/s

Filesystem:              rBlk_nor/s   wBlk_nor/s   rBlk_dir/s

wBlk_dir/s   rBlk_svr/s   wBlk_svr/s     ops/s    rops/s    wops/s

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s

avgrq-sz avgqu-sz   await  svctm  %util

sda               0.00     0.33    0.00    1.33     0.00    10.67

8.00     0.00    0.00   0.00   0.00

sdb               0.00     6.00    0.00 10219.67     0.00 223561.67

  21.88     4.08    0.40   0.09  89.43

sdc               0.00     6.00    0.00 9750.67     0.00 220286.67

  22.59     2.47    0.25   0.09  89.83

dm-0              0.00     0.00    0.00    0.00     0.00     0.00

0.00     0.00    0.00   0.00   0.00

dm-1              0.00     0.00    0.00    1.33     0.00    10.67

8.00     0.00    0.00   0.00   0.00

Filesystem:              rBlk_nor/s   wBlk_nor/s   rBlk_dir/s

wBlk_dir/s   rBlk_svr/s   wBlk_svr/s     ops/s    rops/s    wops/s

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s

avgrq-sz avgqu-sz   await  svctm  %util

sda               0.00     0.00    0.00    1.00     0.00    26.67

  26.67     0.00    0.00   0.00   0.00

sdb               0.00     6.33    0.00 10389.00     0.00 224668.67

  21.63     3.78    0.36   0.09  89.23

sdc               0.00     4.33    0.00 10106.67     0.00 217986.00

  21.57     3.83    0.38   0.09  91.10

dm-0              0.00     0.00    0.00    0.00     0.00     0.00

0.00     0.00    0.00   0.00   0.00

dm-1              0.00     0.00    0.00    1.00     0.00    26.67

  26.67     0.00    0.00   0.00   0.00

2.  For the scalability issue ( 10 hosts performs worse than 9 hosts),

  is there any tuning suggestion to improve it?

Can you post exactly the test you are running and on how many hosts/volumes?  That would help us debug.
In the test, we run vdbench with the following parameters on one host:

sd=sd1,lun=/dev/rbd2,threads=128
sd=sd2,lun=/dev/rbd0,threads=128
sd=sd3,lun=/dev/rbd1,threads=128
*sd=sd4,lun=/dev/rbd3,threads=128
wd=wd1,sd=sd1,xfersize=4k,rdpct=0,openflags=o_direct
wd=wd2,sd=sd2,xfersize=4k,rdpct=0,openflags=o_direct
wd=wd3,sd=sd3,xfersize=4k,rdpct=0,openflags=o_direct
*wd=wd4,sd=sd4,xfersize=4k,rdpct=0,openflags=o_direct
rd=run1,wd=wd*,iorate=100000,elapsed=500,interval=1

Thanks!

Mark

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com