Re: under performing osd, where to look ?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 04/09/2013 04:53 AM, Mark Nelson wrote:
On 04/09/2013 01:48 AM, Matthieu Patou wrote:
On 04/08/2013 05:55 AM, Mark Nelson wrote:
On 04/08/2013 01:09 AM, Matthieu Patou wrote:
On 04/01/2013 11:26 PM, Matthieu Patou wrote:
On 04/01/2013 05:35 PM, Mark Nelson wrote:
On 03/31/2013 06:37 PM, Matthieu Patou wrote:
Hi,

I was doing some testing with iozone and found that performance of an exported rdb volume where 1/3 of the performance of the hard drives.
I was expecting to have a performance penalty but not so important.
I suspect something is not correct in the configuration but I can't
what
exactly.

A couple of things:

1) Were you testing writes?  If so, were your journals on the same
disks as the OSDs?  Ceph writes a full copy of the data to the
journal currently which means you are doing 2 writes for every 1.
This is why some people use fast SSDs for 3-4 journals each.  On the
other hand, that means that losing a journal causes more data
replication and you may lose read performance and capacity if the SSD
is taking up a spot that an OSD would have otherwise occupied.
Oh right, no I don't, I'll try to do a quick test with a ramdrive
(this is just test data I can afford to loose them).
I guess that one SSD can stand the load of 4 HDD maybe more but you're
right if you try to go too far (ie. maybe 20+ OSD) then you might see
the SSD be your new bottleneck.
Ok I did my homework and moved the journal away, it improved the
performance but I'm still ~ 1/3 of the performance of iozone on the hard
drive.

I also briefly added a ramdisk to ceph and benchmarked it and got much
better perf (~180MB/s) but it's still far from the perf of the same
ramdisk benchmarked on one of the osd host.

I'm still convinced that something is not going well because if I run:
rados -p rbd bench 300 write -t 4

I can reach 90% of the performance of the benchmark, the client is
running a stock ubuntu 12.10 kernel:

Linux builder01 3.5.0-23-generic #35-Ubuntu SMP Thu Jan 24 13:15:40 UTC
2013 x86_64 x86_64 x86_64 GNU/Linu


2) Are you factoring in any replication being done?
What do you mean ?
To my understanding when osd0 is receiving the data it should send
them directly to osd1, or will it write it and then read it ?

IE if you are doing write tests, is the pool you are writing to set to
use 2x replication (ie the default)?  For performance benchmarking I
usually set to to 1x replication (ie no replication) unless I am
explicitly testing a replication scenario.
Yes I was using a 2x replication and I was doing performance
benchmarking with this scenario, of course it mean that as the write has
to be acknowledged by the 2 OSD you'll not have more than min(throughput
osd#1, throughput osd#2) and even a bit less as you are losing some time
to send the request to the first OSD and it will also use some time to
send the data to the second OSD.

But all of this shouldn't cause the system to have a throughput that is
1/3 of the nominal capacity of the disks.

Could you reply with the tests you ran and the numbers you got? Seeing the raw data might help.

With 2x replication, every OSD is sending a copy of the data it receives to some other OSD. That means that every OSD is on average now doing 2X the writes, and has 2X the amount of incoming data over the network. At best, assuming your journals are not on the same disks, you write speed is going to be at best half of what the disks actually do.
On the client:

iozone  -I -s 128m  -r 64k

random random bkwd record stride KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread 131072 64 13414 12220 49396 49821 50181 7589 49657 17670 49227 3601088 4993022 9871395 8526676

iozone  -I -s 128m  -r 4m
random random bkwd record stride KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread 131072 64 13414 12220 49396 49821 50181 7589 49657 17670 49227 3601088 4993022 9871395 8526676 random random bkwd record stride KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread 131072 4096 24836 19362 109361 109685 109576 24486 74901 27755 58324 2224014 3665214 8260073 8877222

On ceph-01:
iozone  -I -s 128m  -r 64k
random random bkwd record stride KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread 131072 64 78883 78913 52972 51753 10356 23332 32123 188720 10993 1156601 1524817 3031048 3076885
iozone  -I -s 128m  -r 4m
random random bkwd record stride KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread 131072 4096 98334 97621 96955 96947 83943 83614 83680 83191 87006 767305 905712 1778933 1790281


On ceph-02:
iozone  -I -s 128m  -r 64k
random random bkwd record stride KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread 131072 64 48184 49862 44058 43570 9393 19302 17795 138314 7828 1178037 1540321 3055577 3097672
iozone  -I -s 128m  -r 4m
random random bkwd record stride KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread 131072 4096 62941 62394 63080 61404 56513 55951 57730 54787 55684 769129 913794 1768660 1782636

For the purpose of the test the journal is stored in a 4G ramdisk, the journal size is 1000 (1G).

Regarding the performance drop I'm not sure to understand why having a 2x replication should result in a 1/2 performance, is ceph writing to the first replica and then to the second or is it trying to write to both almost as the same time (there must be some delay because as the data arrive only on replica #1, it has to be read from the network and then resent to replica #2). In the former case it's quite logical to have the performance impact in the later case we should performance numbers be closer to the native performance of the slowest of the replica.

I still see also rados bench being faster:

rados -p rbd bench 300 write -t 1
2013-04-09 22:55:46.351989min lat: 0.105228 max lat: 0.175903 avg lat: 0.109805
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat avg lat
   300       1      2733      2732   36.4224        36  0.110325 0.109805
 Total time run:         300.212541
Total writes made:      2734
Write size:             4194304
Bandwidth (MB/sec):     36.428

Stddev Bandwidth:       2.47917
Max bandwidth (MB/sec): 40
Min bandwidth (MB/sec): 0
Average Latency:        0.109803
Stddev Latency:         0.00182862
Max latency:            0.175903
Min latency:            0.105228

rados -p rbd bench 300 write -t 4

2013-04-09 23:18:01.629757min lat: 0.10903 max lat: 2.39524 avg lat: 0.344683
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat avg lat
   300       4      3484      3480   46.3947        68  0.301892 0.344683
 Total time run:         300.130446
Total writes made:      3484
Write size:             4194304
Bandwidth (MB/sec):     46.433

Stddev Bandwidth:       15.839
Max bandwidth (MB/sec): 80
Min bandwidth (MB/sec): 0
Average Latency:        0.344564
Stddev Latency:         0.220953
Max latency:            2.39524
Min latency:            0.10903


rados -p rbd bench 300 write
2013-04-09 23:03:11.630579min lat: 0.115311 max lat: 6.05309 avg lat: 1.44797
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat avg lat
   300      16      3322      3306   44.0749        48   1.80913 1.44797
 Total time run:         300.886165
Total writes made:      3323
Write size:             4194304
Bandwidth (MB/sec):     44.176

Stddev Bandwidth:       19.4799
Max bandwidth (MB/sec): 100
Min bandwidth (MB/sec): 0
Average Latency:        1.44864
Stddev Latency:         0.842172
Max latency:            6.05309
Min latency:            0.115311

Matthieu.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux