On 04/06/2013 12:11 AM, Kelvin_Huang@xxxxxxxxxx wrote:
Hi all,
Hi Kelvin!
I have some problem after my RBD performance test
Setup:
Linux kernel: 3.6.11
OS: Ubuntu 12.04
RAID card: LSI MegaRAID SAS 9260-4i For every HDD: RAID0, Write Policy:
Write Back with BBU, Read Policy: ReadAhead, IO Policy: Direct
Storage server number : 1
Storage server :
8 * HDD (each storage server has 8 osd, 7200 rpm, 2T)
4 * SSD (2 osd use 1 SSD as journal, the SSD divided into two partition
sdx1, sdx2)
Ceph version : 0.56.4
Replicas : 2
Monitor number:1
The write speed of HDD:
# dd if=/dev/zero of=/dev/sdd bs=1024k count=10000 oflag=direct
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 69.3961 s, 151 MB/s
The write speed of SSD:
# dd if=/dev/zero of=/dev/sdb bs=1024k count=10000 oflag=direct
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 40.8671 s, 257 MB/s
One thing I would suggest doing is seeing how well the spinning disks
and SSDS do if you write to all of them concurrently. It's still not
entirely accurate though as Ceph is writing to multiple individual files
that are stored in giant directory hierarchy which requires dentry
lookups and is also doing xattrs and other stuff. Still, this may give
you an idea if there is some bottleneck for sequential writes (say an
oversubscribed expander backplane running at 3.0Gb/s due to SATA disks
or something).
Then we use the RADOS benchmark and collectl to observed write performance
#rados -p rbd bench 300 write -t 256
2013-04-05 14:31:13.732737min lat: 4.28207 max lat: 5.92085 avg lat: 4.78598
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
300 256 16043 15787 210.455 196 5.91 4.78598
Total time run: 300.588962
Total writes made: 16043
Write size: 4194304
Bandwidth (MB/sec): 213.488
Stddev Bandwidth: 40.6795
Max bandwidth (MB/sec): 288
Min bandwidth (MB/sec): 0
Average Latency: 4.75647
Stddev Latency: 0.37182
Max latency: 5.93183
Min latency: 0.590936
collectl on OSDs :
#collectl --iosize -sCDN --dskfilt "sd(c|d|e|f|g|h|i|j)"
# DISK STATISTICS (/sec)
#
<---------reads---------><---------writes---------><--------averages-------->
Pct
#Name KBytes Merged IOs Size KBytes Merged IOs Size RWSize
QLen Wait SvcTim Util
sdc 0 0 0 0 76848 563 460 167 167
12 26 0 42
sdd 0 0 0 0 45100 0 165 273 273
6 36 1 30
sde 0 0 0 0 73800 0 270 273 273
3 14 1 41
sdf 0 0 0 0 73800 0 270 273 273
17 64 1 33
sdg 0 0 0 0 41000 0 150 273 273
1 7 0 10
sdh 0 0 0 0 57400 0 210 273 273
4 20 1 27
sdi 0 0 0 0 36904 0 136 271 271
0 5 0 7
sdj 0 0 0 0 77776 0 285 273 272
28 87 1 48
collectl on SSDs :
#collectl --iosize -sCDN --dskfilt "sd(b|k|l|m)"
# DISK STATISTICS (/sec)
#
<---------reads---------><---------writes---------><--------averages-------->
Pct
#Name KBytes Merged IOs Size KBytes Merged IOs Size RWSize
QLen Wait SvcTim Util
sdb 0 0 0 0 115552 0 388 298 297
75 159 2 77
sdk 0 0 0 0 114592 0 389 295 294
12 33 0 38
sdl 0 0 0 0 100364 0 334 300 300
35 148 2 69
sdm 0 0 0 0 101644 0 345 295 294
245 583 2 99 <= almost 99%
Thanks for providing so much information, it definitely helps!
My question is:
1.The rados benchmark write is a random write right?
rados bench just writes out objects, so it's random from the perspective
that each OSD is pseudo-randomly assigned a 4MB chunk of data to write
out (as a file). It's not explicitly random though in that there is no
inherent requirement that the objects be written out in any particular
place relative to each other (like if you were doing explicit random
writes in a huge file). How random the writes end up being will depend
on how fragmented the disk is and where data ends up getting placed by
the underlying OSD file system.
2.Why the bottleneck of write bandwidth occur at 213MB/s even if
increased the concurrent (-t 512) ?
One quick test to do is make a new pool with something like 2048 PGs and
replication 1, and see how performance with rados bench is. Also, what
filesystem are you using under the OSDs and what mkfs/mount options?
I don't have tests with 8 drives and 4 SSDs, but I do have some tests
with 6 drives and 2 very fast (~400MB/s) SSDs on a SAS2208 based
controller (similar to a LSI SAS9265):
http://ceph.com/uncategorized/argonaut-vs-bobtail-performance-preview/
Look at the 4MB results wtih 256 concurrent writes and 6 disks with 2
SSDs. Those results are with 1x replication. Both the XFS and EXT4
numbers seem to be roughly similar to what you saw (~350MB/s-400MB/s
with 1x replication vs 213.5MB/s with 2x replication you saw). BTRFS
does significantly better.
It looks a bit worse, because the collectl show SSD's write
throughput only has 100M~120M, but SSD should be able to 250MB/s
It's probably not the SSDs. See below.
3.Why some SSD (sdm) [Util] almost 99% that means data written to osd
not enough distributed ?
If the SSDS are just being used for journals, the writes will be very
simple. The OSDs just append data to the journal at whatever spot the
last write left off. It's all sequential and there's almost no seek
behaviour at all.
What I did find that was interesting though is that queue wait times on
SSD journals during 4MB writes were higher on RAID based controllers
(Say the LSI SAS2208) vs on simple SAS controllers (LSI SAS2308):
http://ceph.com/wp-content/uploads/2012/10/16Concurrent4MWriteJournalWait.png
I suspect this had less to do with the SSDs themselves and more to do
with the controller itself possibly becoming a bottleneck with the kind
of workload that ceph is doing across all of the disks and SSDs. Simple
SAS controllers have no cache and don't have to do as much processing on
the card. With fast SSD journals it's entirely possible that a "smart"
RAID controller may actually be worse than simple/cheap SAS controllers.
4.If bottleneck of write performance not SSD , What it should be write
bottleneck ?
5.How can I improve write performance ?
1) Upgrading to 0.60 might improve performance. We made a number of
changes that have improved performance, though you will likely see the
greatest increase with smaller IOs.
2) If you are using XFS, you may see a performance increase by switching
to EXT4 or BTRFS. BTRFS especially tends to do better on setups similar
to yours, but may degrade with use and end up slower over time. Not
sure if this is still the case in newer kernels though. With more OSDs
(~24) per node BTRFS loses its edge and EXT4/XFS start looking better
even on fresh filesystems.
3) I've had my fastest results on systems that don't use SAS expanders
and instead use multiple controllers. I've seen some chassis with
expanders go very fast, but it's not consistent and seems to depend on
the combination of controller, expander, and drives. If you can try
putting in a 2nd controller and directly connect your disks and SSDs to
that controller, I would be very curious to hear if your results change.
Hope this helps!
Thanks!!
- Kelvin
Mark
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com