Re: Performance doesn't scale well on a full ssd cluster.

Mark Nelson <mark.nelson@xxxxxxxxxxx> · Mon, 20 Oct 2014 10:43:51 -0500

On 10/20/2014 09:28 AM, Mark Wu wrote:

2014-10-20 21:04 GMT+08:00 Mark Nelson <mark.nelson@xxxxxxxxxxx
<mailto:mark.nelson@xxxxxxxxxxx>>:

    On 10/20/2014 06:27 AM, Mark Wu wrote:

        Test result Update:

        Number of Hosts   Maximum single volume IOPS   Maximum
        aggregated IOPS
           SSD Disk IOPS     SSD Disk Utilization

        7                              14k 45k          9800+
                 90%

        8                              21k
           50k                                      9800+
             90%

        9                              30k
           56k                                      9800+ 90%

        10                            40k
           54k                                      8200+
             70%

        Note:  the disk average request size is about 20 sectors, not
        same as
        client side (4k)

        I have two questions about the result:

        1. No matter how many nodes the cluster has,  the backend write
        throughput is always almost 8 times of client side.  Is it normal
        behavior in Ceph,  or caused by some wrong configuration in my
        setup?

    Are you counting journal writes and replication into this?  Also
    note that journal writes will be slightly larger and padded to a 4K
    boundary for each write due to header information.  I suspect for
    coalesced journal writes we may be able to pack the headers together
    to reduce this overhead.

Yes, the journal writes and replication are counted into backend
writes.  Each ssd disk has two partitions: the raw one is used for
journal and the one formatted as xfs is used osd data. The replica
setting is 2.
So considering the journal writes and replication,  I expect the writes
on backend is 4 times of client side.  From the perspective of disk
utilization,  it's good because it's already close to the physical
limitation.
But the overhead is too big. Is it possible to try your idea without
modifying code?  If yes, I am glad to give it a try.

Sadly it will require code changes and is something we've only briefly 
talked about.  So it is surprising that you would see 8x writes with 2x 
replication and on-disk journals imho.  In the past one of the things 
I've done is add up all of the totals for the entire test both on the 
client side and on the server side just to make sure that the numbers 
are right.  At least in past testing things properly added up, at least 
on our test rig.

        The following data is captured in the 9 hosts test.  Roughly, the
        aggregated backend write throughput is 1000 * 22 * 512  * 2 * 9
        = 1980M/s

        The client side is 56k * 4 = 244M/s

        Filesystem:              rBlk_nor/s   wBlk_nor/s   rBlk_dir/s
        wBlk_dir/s   rBlk_svr/s   wBlk_svr/s     ops/s    rops/s    wops/s

        Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
        avgrq-sz avgqu-sz   await  svctm  %util
        sda               0.00     0.33    0.00    1.33     0.00    10.67
        8.00     0.00    0.00   0.00   0.00
        sdb               0.00     6.00    0.00 10219.67     0.00 223561.67
           21.88     4.08    0.40   0.09  89.43
        sdc               0.00     6.00    0.00 9750.67     0.00 220286.67
           22.59     2.47    0.25   0.09  89.83
        dm-0              0.00     0.00    0.00    0.00     0.00     0.00
        0.00     0.00    0.00   0.00   0.00
        dm-1              0.00     0.00    0.00    1.33     0.00    10.67
        8.00     0.00    0.00   0.00   0.00

        Filesystem:              rBlk_nor/s   wBlk_nor/s   rBlk_dir/s
        wBlk_dir/s   rBlk_svr/s   wBlk_svr/s     ops/s    rops/s    wops/s

        Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
        avgrq-sz avgqu-sz   await  svctm  %util
        sda               0.00     0.00    0.00    1.00     0.00    26.67
           26.67     0.00    0.00   0.00   0.00
        sdb               0.00     6.33    0.00 10389.00     0.00 224668.67
           21.63     3.78    0.36   0.09  89.23
        sdc               0.00     4.33    0.00 10106.67     0.00 217986.00
           21.57     3.83    0.38   0.09  91.10
        dm-0              0.00     0.00    0.00    0.00     0.00     0.00
        0.00     0.00    0.00   0.00   0.00
        dm-1              0.00     0.00    0.00    1.00     0.00    26.67
           26.67     0.00    0.00   0.00   0.00

        2.  For the scalability issue ( 10 hosts performs worse than 9
        hosts),
           is there any tuning suggestion to improve it?

    Can you post exactly the test you are running and on how many
    hosts/volumes?  That would help us debug.

In the test, we run vdbench with the following parameters on one host:

sd=sd1,lun=/dev/rbd2,threads=128
sd=sd2,lun=/dev/rbd0,threads=128
sd=sd3,lun=/dev/rbd1,threads=128
*sd=sd4,lun=/dev/rbd3,threads=128
wd=wd1,sd=sd1,xfersize=4k,rdpct=0,openflags=o_direct
wd=wd2,sd=sd2,xfersize=4k,rdpct=0,openflags=o_direct
wd=wd3,sd=sd3,xfersize=4k,rdpct=0,openflags=o_direct
*wd=wd4,sd=sd4,xfersize=4k,rdpct=0,openflags=o_direct
rd=run1,wd=wd*,iorate=100000,elapsed=500,interval=1

Ok, I don't know a ton about vdbench.  Is there any reason you are 
limiting the iorate to 100000?  You might try running the test on 
multiple clients and seeing if that makes any difference.  If you feel 
like it, it might be worth also running similar tests with something 
like fio just to verify that the same behaviour is present.

    Thanks!
    Mark

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com