Re: Performance doesn't scale well on a full ssd cluster.

Christian Balzer <chibi@xxxxxxx> · Fri, 24 Oct 2014 10:11:09 +0900

Hello,

in a nutshell, I can confirm that write amplification, see inline.

On Mon, 20 Oct 2014 10:43:51 -0500 Mark Nelson wrote:

> On 10/20/2014 09:28 AM, Mark Wu wrote:
> >
> >
> > 2014-10-20 21:04 GMT+08:00 Mark Nelson <mark.nelson@xxxxxxxxxxx
> > <mailto:mark.nelson@xxxxxxxxxxx>>:
> >
> >     On 10/20/2014 06:27 AM, Mark Wu wrote:
> >
> >         Test result Update:
> >
> >
> >         Number of Hosts   Maximum single volume IOPS   Maximum
> >         aggregated IOPS
> >            SSD Disk IOPS     SSD Disk Utilization
> >
> >         7                              14k 45k          9800+
> >                  90%
> >
> >         8                              21k
> >            50k                                      9800+
> >              90%
> >
> >         9                              30k
> >            56k                                      9800+ 90%
> >
> >         10                            40k
> >            54k                                      8200+
> >              70%
> >
> >
> >
> >         Note:  the disk average request size is about 20 sectors, not
> >         same as
> >         client side (4k)
> >
> >
> >         I have two questions about the result:
> >
> >
> >         1. No matter how many nodes the cluster has,  the backend write
> >         throughput is always almost 8 times of client side.  Is it
> > normal behavior in Ceph,  or caused by some wrong configuration in my
> >         setup?
> >
> >
> >     Are you counting journal writes and replication into this?  Also
> >     note that journal writes will be slightly larger and padded to a 4K
> >     boundary for each write due to header information.  I suspect for
> >     coalesced journal writes we may be able to pack the headers
> > together to reduce this overhead.
> >
> >
> > Yes, the journal writes and replication are counted into backend
> > writes.  Each ssd disk has two partitions: the raw one is used for
> > journal and the one formatted as xfs is used osd data. The replica
> > setting is 2.
> > So considering the journal writes and replication,  I expect the writes
> > on backend is 4 times of client side.  From the perspective of disk
> > utilization,  it's good because it's already close to the physical
> > limitation.
> > But the overhead is too big. Is it possible to try your idea without
> > modifying code?  If yes, I am glad to give it a try.
> 
> Sadly it will require code changes and is something we've only briefly 
> talked about.  So it is surprising that you would see 8x writes with 2x 
> replication and on-disk journals imho.  In the past one of the things 
> I've done is add up all of the totals for the entire test both on the 
> client side and on the server side just to make sure that the numbers 
> are right.  At least in past testing things properly added up, at least 
> on our test rig.
> 
Using rbd bench with the default (4MB) block size it adds up pretty well,
even better than expected on my temporary test machine (single
storage server, no replication, 8 DC S3700 100GB for OSDs, journal on
partition of the same device, IB 4xQDR link to client, Ceph 0.80.7).

During that test I see about 1.5GB/s disk I/O with atop, which matches the
result beautifully, as in:
Total time run:         30.205574
Total writes made:      5393
Write size:             4194304
Bandwidth (MB/sec):     714.173 

That makes about 21GB written or about 2.7GB per OSD.
According to the SSD SMART values, they wrote about 4.6GB during that
test, so better than 2x amplification expected.

As a side note, while the 8 3.2GHz cores get moderately busy (50%ish)
during this test, the limiting factor are the SSDs at their 200MB/s max
write speed.

Now if we run that same test with a block size of 4KB one gets:
Total time run:         30.033196
Total writes made:      126944
Write size:             4096
Bandwidth (MB/sec):     16.511 

This makes about 508MB written or roughly 64MB per OSD.
According to the SMART values of the SSDs they wrote 768MB each or in
other words 6 times more than one would have expected with a write
amplification of 2. 
Not only does this obviously point at another vast reservoir for
improvements, it also means that your SSDs will be whittled down to size at
a rate that's both totally unexpected and unacceptable. 

And the side note here unsurprisingly is that during this test all 8 cores
go into meltdown while the SSDs are bored (about 20% utilization) for a
whooping 4000 IOPS. Here's hoping for Giant and Hammer...

Christian
> >
> >
> >
> >
> >
> >         The following data is captured in the 9 hosts test.  Roughly,
> > the aggregated backend write throughput is 1000 * 22 * 512  * 2 * 9
> >         = 1980M/s
> >
> >         The client side is 56k * 4 = 244M/s
> >
> >
> >         Filesystem:              rBlk_nor/s   wBlk_nor/s   rBlk_dir/s
> >         wBlk_dir/s   rBlk_svr/s   wBlk_svr/s     ops/s    rops/s
> > wops/s
> >
> >         Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s
> > wsec/s avgrq-sz avgqu-sz   await  svctm  %util
> >         sda               0.00     0.33    0.00    1.33     0.00
> > 10.67 8.00     0.00    0.00   0.00   0.00
> >         sdb               0.00     6.00    0.00 10219.67     0.00
> > 223561.67 21.88     4.08    0.40   0.09  89.43
> >         sdc               0.00     6.00    0.00 9750.67     0.00
> > 220286.67 22.59     2.47    0.25   0.09  89.83
> >         dm-0              0.00     0.00    0.00    0.00     0.00
> > 0.00 0.00     0.00    0.00   0.00   0.00
> >         dm-1              0.00     0.00    0.00    1.33     0.00
> > 10.67 8.00     0.00    0.00   0.00   0.00
> >
> >         Filesystem:              rBlk_nor/s   wBlk_nor/s   rBlk_dir/s
> >         wBlk_dir/s   rBlk_svr/s   wBlk_svr/s     ops/s    rops/s
> > wops/s
> >
> >         Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s
> > wsec/s avgrq-sz avgqu-sz   await  svctm  %util
> >         sda               0.00     0.00    0.00    1.00     0.00
> > 26.67 26.67     0.00    0.00   0.00   0.00
> >         sdb               0.00     6.33    0.00 10389.00     0.00
> > 224668.67 21.63     3.78    0.36   0.09  89.23
> >         sdc               0.00     4.33    0.00 10106.67     0.00
> > 217986.00 21.57     3.83    0.38   0.09  91.10
> >         dm-0              0.00     0.00    0.00    0.00     0.00
> > 0.00 0.00     0.00    0.00   0.00   0.00
> >         dm-1              0.00     0.00    0.00    1.00     0.00
> > 26.67 26.67     0.00    0.00   0.00   0.00
> >
> >
> >         2.  For the scalability issue ( 10 hosts performs worse than 9
> >         hosts),
> >            is there any tuning suggestion to improve it?
> >
> >
> >     Can you post exactly the test you are running and on how many
> >     hosts/volumes?  That would help us debug.
> >
> > In the test, we run vdbench with the following parameters on one host:
> >
> > sd=sd1,lun=/dev/rbd2,threads=128
> > sd=sd2,lun=/dev/rbd0,threads=128
> > sd=sd3,lun=/dev/rbd1,threads=128
> > *sd=sd4,lun=/dev/rbd3,threads=128
> > wd=wd1,sd=sd1,xfersize=4k,rdpct=0,openflags=o_direct
> > wd=wd2,sd=sd2,xfersize=4k,rdpct=0,openflags=o_direct
> > wd=wd3,sd=sd3,xfersize=4k,rdpct=0,openflags=o_direct
> > *wd=wd4,sd=sd4,xfersize=4k,rdpct=0,openflags=o_direct
> > rd=run1,wd=wd*,iorate=100000,elapsed=500,interval=1
> 
> Ok, I don't know a ton about vdbench.  Is there any reason you are 
> limiting the iorate to 100000?  You might try running the test on 
> multiple clients and seeing if that makes any difference.  If you feel 
> like it, it might be worth also running similar tests with something 
> like fio just to verify that the same behaviour is present.
> 
> >
> >
> >
> >     Thanks!
> >     Mark
> >
> >
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com