Re: RBD performance test (write) problem

Mark Nelson <mark.nelson@xxxxxxxxxxx> · Sat, 06 Apr 2013 10:13:24 -0500

On 04/06/2013 12:11 AM, Kelvin_Huang@xxxxxxxxxx wrote:
Hi all,

Hi Kelvin!

I have some problem after my RBD performance test

Setup:

Linux kernel: 3.6.11

OS: Ubuntu 12.04

RAID card: LSI MegaRAID SAS 9260-4i  For every HDD: RAID0, Write Policy:
Write Back with BBU, Read Policy: ReadAhead, IO Policy: Direct

Storage server number : 1

Storage server :

8 * HDD (each storage server has 8 osd, 7200 rpm, 2T)

4 * SSD (2 osd use 1 SSD as journal, the SSD divided into two partition
sdx1, sdx2)

Ceph version : 0.56.4

Replicas : 2

Monitor number:1

The write speed of HDD:

# dd if=/dev/zero of=/dev/sdd bs=1024k count=10000 oflag=direct

10000+0 records in

10000+0 records out

10485760000 bytes (10 GB) copied, 69.3961 s, 151 MB/s

The write speed of SSD:

# dd if=/dev/zero of=/dev/sdb bs=1024k count=10000 oflag=direct

10000+0 records in

10000+0 records out

10485760000 bytes (10 GB) copied, 40.8671 s, 257 MB/s

One thing I would suggest doing is seeing how well the spinning disks 
and SSDS do if you write to all of them concurrently.  It's still not 
entirely accurate though as Ceph is writing to multiple individual files 
that are stored in giant directory hierarchy which requires dentry 
lookups and is also doing xattrs and other stuff.  Still, this may give 
you an idea if there is some bottleneck for sequential writes (say an 
oversubscribed expander backplane running at 3.0Gb/s due to SATA disks 
or something).

Then we use the RADOS benchmark and collectl to observed write performance

#rados -p rbd bench 300 write -t 256

2013-04-05 14:31:13.732737min lat: 4.28207 max lat: 5.92085 avg lat: 4.78598

    sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat

    300     256     16043     15787   210.455       196      5.91   4.78598

Total time run:         300.588962

Total writes made:      16043

Write size:             4194304

Bandwidth (MB/sec):     213.488

Stddev Bandwidth:       40.6795

Max bandwidth (MB/sec): 288

Min bandwidth (MB/sec): 0

Average Latency:        4.75647

Stddev Latency:         0.37182

Max latency:            5.93183

Min latency:            0.590936

collectl on OSDs :

#collectl  --iosize -sCDN --dskfilt "sd(c|d|e|f|g|h|i|j)"

# DISK STATISTICS (/sec)

#
<---------reads---------><---------writes---------><--------averages-------->
Pct

#Name       KBytes Merged  IOs Size  KBytes Merged  IOs Size  RWSize
QLen  Wait SvcTim Util

sdc              0      0    0    0   76848    563  460  167     167
12    26      0   42

sdd              0      0    0    0   45100      0  165  273     273
6    36      1   30

sde              0      0    0    0   73800      0  270  273     273
3    14      1   41

sdf              0      0    0    0   73800      0  270  273     273
17    64      1   33

sdg              0      0    0    0   41000      0  150  273     273
1     7      0   10

sdh              0      0    0    0   57400      0  210  273     273
4    20      1   27

sdi              0      0    0    0   36904      0  136  271     271
0     5      0    7

sdj              0      0    0    0   77776      0  285  273     272
28    87      1   48

collectl on SSDs :

#collectl  --iosize -sCDN --dskfilt "sd(b|k|l|m)"

# DISK STATISTICS (/sec)

#
<---------reads---------><---------writes---------><--------averages-------->
Pct

#Name       KBytes Merged  IOs Size  KBytes Merged  IOs Size  RWSize
QLen  Wait SvcTim Util

sdb              0      0    0    0  115552      0  388  298     297
75   159      2   77

sdk              0      0    0    0  114592      0  389  295     294
12    33      0   38

sdl              0      0    0    0  100364      0  334  300     300
35   148      2   69

sdm              0      0    0    0  101644      0  345  295     294
245   583      2   99 <= almost 99%

Thanks for providing so much information, it definitely helps!

My question is:

1.The rados benchmark write is a random write right?

rados bench just writes out objects, so it's random from the perspective 
that each OSD is pseudo-randomly assigned a 4MB chunk of data to write 
out (as a file).  It's not explicitly random though in that there is no 
inherent requirement that the objects be written out in any particular 
place relative to each other (like if you were doing explicit random 
writes in a huge file).  How random the writes end up being will depend 
on how fragmented the disk is and where data ends up getting placed by 
the underlying OSD file system.

2.Why the bottleneck of write bandwidth occur at 213MB/s even if
increased the concurrent (-t 512) ?

One quick test to do is make a new pool with something like 2048 PGs and 
replication 1, and see how performance with rados bench is.  Also, what 
filesystem are you using under the OSDs and what mkfs/mount options?

I don't have tests with 8 drives and 4 SSDs, but I do have some tests 
with 6 drives and 2 very fast (~400MB/s) SSDs on a  SAS2208 based 
controller (similar to a LSI SAS9265):

http://ceph.com/uncategorized/argonaut-vs-bobtail-performance-preview/

Look at the 4MB results wtih 256 concurrent writes and 6 disks with 2 
SSDs.  Those results are with 1x replication.  Both the XFS and EXT4 
numbers seem to be roughly similar to what you saw (~350MB/s-400MB/s 
with 1x replication vs 213.5MB/s with 2x replication you saw).  BTRFS 
does significantly better.

   It looks a bit worse, because the collectl show SSD's write
throughput only has 100M~120M, but SSD should be able to 250MB/s

It's probably not the SSDs.  See below.

3.Why some SSD (sdm) [Util] almost 99% that means data written to osd
not enough distributed ?

If the SSDS are just being used for journals, the writes will be very 
simple.  The OSDs just append data to the journal at whatever spot the 
last write left off.  It's all sequential and there's almost no seek 
behaviour at all.

What I did find that was interesting though is that queue wait times on 
SSD journals during 4MB writes were higher on RAID based controllers 
(Say the LSI SAS2208) vs on simple SAS controllers (LSI SAS2308):

http://ceph.com/wp-content/uploads/2012/10/16Concurrent4MWriteJournalWait.png

I suspect this had less to do with the SSDs themselves and more to do 
with the controller itself possibly becoming a bottleneck with the kind 
of workload that ceph is doing across all of the disks and SSDs.  Simple 
SAS controllers have no cache and don't have to do as much processing on 
the card.  With fast SSD journals it's entirely possible that a "smart" 
RAID controller may actually be worse than simple/cheap SAS controllers.

4.If bottleneck of write performance not SSD , What it should be write
bottleneck ?

5.How can I improve write performance ?

1) Upgrading to 0.60 might improve performance.  We made a number of 
changes that have improved performance, though you will likely see the 
greatest increase with smaller IOs.

2) If you are using XFS, you may see a performance increase by switching 
to EXT4 or BTRFS.  BTRFS especially tends to do better on setups similar 
to yours, but may degrade with use and end up slower over time.  Not 
sure if this is still the case in newer kernels though.  With more OSDs 
(~24) per node BTRFS loses its edge and EXT4/XFS start looking better 
even on fresh filesystems.

3) I've had my fastest results on systems that don't use SAS expanders 
and instead use multiple controllers.  I've seen some chassis with 
expanders go very fast, but it's not consistent and seems to depend on 
the combination of controller, expander, and drives.  If you can try 
putting in a 2nd controller and directly connect your disks and SSDs to 
that controller, I would be very curious to hear if your results change.

Hope this helps!

Thanks!!

- Kelvin

Mark

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com