Re: RBD performance test (write) problem

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 04/06/2013 12:11 AM, Kelvin_Huang@xxxxxxxxxx wrote:
Hi all,

Hi Kelvin!


I have some problem after my RBD performance test

Setup:

Linux kernel: 3.6.11

OS: Ubuntu 12.04

RAID card: LSI MegaRAID SAS 9260-4i  For every HDD: RAID0, Write Policy:
Write Back with BBU, Read Policy: ReadAhead, IO Policy: Direct

Storage server number : 1

Storage server :

8 * HDD (each storage server has 8 osd, 7200 rpm, 2T)

4 * SSD (2 osd use 1 SSD as journal, the SSD divided into two partition
sdx1, sdx2)

Ceph version : 0.56.4

Replicas : 2

Monitor number:1

The write speed of HDD:

# dd if=/dev/zero of=/dev/sdd bs=1024k count=10000 oflag=direct

10000+0 records in

10000+0 records out

10485760000 bytes (10 GB) copied, 69.3961 s, 151 MB/s

The write speed of SSD:

# dd if=/dev/zero of=/dev/sdb bs=1024k count=10000 oflag=direct

10000+0 records in

10000+0 records out

10485760000 bytes (10 GB) copied, 40.8671 s, 257 MB/s


One thing I would suggest doing is seeing how well the spinning disks and SSDS do if you write to all of them concurrently. It's still not entirely accurate though as Ceph is writing to multiple individual files that are stored in giant directory hierarchy which requires dentry lookups and is also doing xattrs and other stuff. Still, this may give you an idea if there is some bottleneck for sequential writes (say an oversubscribed expander backplane running at 3.0Gb/s due to SATA disks or something).

Then we use the RADOS benchmark and collectl to observed write performance

#rados -p rbd bench 300 write -t 256

2013-04-05 14:31:13.732737min lat: 4.28207 max lat: 5.92085 avg lat: 4.78598

    sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat

    300     256     16043     15787   210.455       196      5.91   4.78598

Total time run:         300.588962

Total writes made:      16043

Write size:             4194304

Bandwidth (MB/sec):     213.488

Stddev Bandwidth:       40.6795

Max bandwidth (MB/sec): 288

Min bandwidth (MB/sec): 0

Average Latency:        4.75647

Stddev Latency:         0.37182

Max latency:            5.93183

Min latency:            0.590936

collectl on OSDs :

#collectl  --iosize -sCDN --dskfilt "sd(c|d|e|f|g|h|i|j)"

# DISK STATISTICS (/sec)

#
<---------reads---------><---------writes---------><--------averages-------->
Pct

#Name       KBytes Merged  IOs Size  KBytes Merged  IOs Size  RWSize
QLen  Wait SvcTim Util

sdc              0      0    0    0   76848    563  460  167     167
12    26      0   42

sdd              0      0    0    0   45100      0  165  273     273
6    36      1   30

sde              0      0    0    0   73800      0  270  273     273
3    14      1   41

sdf              0      0    0    0   73800      0  270  273     273
17    64      1   33

sdg              0      0    0    0   41000      0  150  273     273
1     7      0   10

sdh              0      0    0    0   57400      0  210  273     273
4    20      1   27

sdi              0      0    0    0   36904      0  136  271     271
0     5      0    7

sdj              0      0    0    0   77776      0  285  273     272
28    87      1   48

collectl on SSDs :

#collectl  --iosize -sCDN --dskfilt "sd(b|k|l|m)"

# DISK STATISTICS (/sec)

#
<---------reads---------><---------writes---------><--------averages-------->
Pct

#Name       KBytes Merged  IOs Size  KBytes Merged  IOs Size  RWSize
QLen  Wait SvcTim Util

sdb              0      0    0    0  115552      0  388  298     297
75   159      2   77

sdk              0      0    0    0  114592      0  389  295     294
12    33      0   38

sdl              0      0    0    0  100364      0  334  300     300
35   148      2   69

sdm              0      0    0    0  101644      0  345  295     294
245   583      2   99 <= almost 99%


Thanks for providing so much information, it definitely helps!

My question is:

1.The rados benchmark write is a random write right?

rados bench just writes out objects, so it's random from the perspective that each OSD is pseudo-randomly assigned a 4MB chunk of data to write out (as a file). It's not explicitly random though in that there is no inherent requirement that the objects be written out in any particular place relative to each other (like if you were doing explicit random writes in a huge file). How random the writes end up being will depend on how fragmented the disk is and where data ends up getting placed by the underlying OSD file system.


2.Why the bottleneck of write bandwidth occur at 213MB/s even if
increased the concurrent (-t 512) ?

One quick test to do is make a new pool with something like 2048 PGs and replication 1, and see how performance with rados bench is. Also, what filesystem are you using under the OSDs and what mkfs/mount options?

I don't have tests with 8 drives and 4 SSDs, but I do have some tests with 6 drives and 2 very fast (~400MB/s) SSDs on a SAS2208 based controller (similar to a LSI SAS9265):

http://ceph.com/uncategorized/argonaut-vs-bobtail-performance-preview/

Look at the 4MB results wtih 256 concurrent writes and 6 disks with 2 SSDs. Those results are with 1x replication. Both the XFS and EXT4 numbers seem to be roughly similar to what you saw (~350MB/s-400MB/s with 1x replication vs 213.5MB/s with 2x replication you saw). BTRFS does significantly better.


   It looks a bit worse, because the collectl show SSD's write
throughput only has 100M~120M, but SSD should be able to 250MB/s

It's probably not the SSDs.  See below.


3.Why some SSD (sdm) [Util] almost 99% that means data written to osd
not enough distributed ?

If the SSDS are just being used for journals, the writes will be very simple. The OSDs just append data to the journal at whatever spot the last write left off. It's all sequential and there's almost no seek behaviour at all.

What I did find that was interesting though is that queue wait times on SSD journals during 4MB writes were higher on RAID based controllers (Say the LSI SAS2208) vs on simple SAS controllers (LSI SAS2308):

http://ceph.com/wp-content/uploads/2012/10/16Concurrent4MWriteJournalWait.png

I suspect this had less to do with the SSDs themselves and more to do with the controller itself possibly becoming a bottleneck with the kind of workload that ceph is doing across all of the disks and SSDs. Simple SAS controllers have no cache and don't have to do as much processing on the card. With fast SSD journals it's entirely possible that a "smart" RAID controller may actually be worse than simple/cheap SAS controllers.


4.If bottleneck of write performance not SSD , What it should be write
bottleneck ?

5.How can I improve write performance ?

1) Upgrading to 0.60 might improve performance. We made a number of changes that have improved performance, though you will likely see the greatest increase with smaller IOs.

2) If you are using XFS, you may see a performance increase by switching to EXT4 or BTRFS. BTRFS especially tends to do better on setups similar to yours, but may degrade with use and end up slower over time. Not sure if this is still the case in newer kernels though. With more OSDs (~24) per node BTRFS loses its edge and EXT4/XFS start looking better even on fresh filesystems.

3) I've had my fastest results on systems that don't use SAS expanders and instead use multiple controllers. I've seen some chassis with expanders go very fast, but it's not consistent and seems to depend on the combination of controller, expander, and drives. If you can try putting in a 2nd controller and directly connect your disks and SSDs to that controller, I would be very curious to hear if your results change.

Hope this helps!


Thanks!!

- Kelvin


Mark


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux