On 04/09/2013 04:53 AM, Mark Nelson wrote:
On 04/09/2013 01:48 AM, Matthieu Patou wrote:
On 04/08/2013 05:55 AM, Mark Nelson wrote:
On 04/08/2013 01:09 AM, Matthieu Patou wrote:
On 04/01/2013 11:26 PM, Matthieu Patou wrote:
On 04/01/2013 05:35 PM, Mark Nelson wrote:
On 03/31/2013 06:37 PM, Matthieu Patou wrote:
Hi,
I was doing some testing with iozone and found that performance
of an
exported rdb volume where 1/3 of the performance of the hard
drives.
I was expecting to have a performance penalty but not so important.
I suspect something is not correct in the configuration but I can't
what
exactly.
A couple of things:
1) Were you testing writes? If so, were your journals on the same
disks as the OSDs? Ceph writes a full copy of the data to the
journal currently which means you are doing 2 writes for every 1.
This is why some people use fast SSDs for 3-4 journals each. On the
other hand, that means that losing a journal causes more data
replication and you may lose read performance and capacity if the
SSD
is taking up a spot that an OSD would have otherwise occupied.
Oh right, no I don't, I'll try to do a quick test with a ramdrive
(this is just test data I can afford to loose them).
I guess that one SSD can stand the load of 4 HDD maybe more but
you're
right if you try to go too far (ie. maybe 20+ OSD) then you might see
the SSD be your new bottleneck.
Ok I did my homework and moved the journal away, it improved the
performance but I'm still ~ 1/3 of the performance of iozone on the
hard
drive.
I also briefly added a ramdisk to ceph and benchmarked it and got much
better perf (~180MB/s) but it's still far from the perf of the same
ramdisk benchmarked on one of the osd host.
I'm still convinced that something is not going well because if I run:
rados -p rbd bench 300 write -t 4
I can reach 90% of the performance of the benchmark, the client is
running a stock ubuntu 12.10 kernel:
Linux builder01 3.5.0-23-generic #35-Ubuntu SMP Thu Jan 24 13:15:40
UTC
2013 x86_64 x86_64 x86_64 GNU/Linu
2) Are you factoring in any replication being done?
What do you mean ?
To my understanding when osd0 is receiving the data it should send
them directly to osd1, or will it write it and then read it ?
IE if you are doing write tests, is the pool you are writing to set to
use 2x replication (ie the default)? For performance benchmarking I
usually set to to 1x replication (ie no replication) unless I am
explicitly testing a replication scenario.
Yes I was using a 2x replication and I was doing performance
benchmarking with this scenario, of course it mean that as the write has
to be acknowledged by the 2 OSD you'll not have more than min(throughput
osd#1, throughput osd#2) and even a bit less as you are losing some time
to send the request to the first OSD and it will also use some time to
send the data to the second OSD.
But all of this shouldn't cause the system to have a throughput that is
1/3 of the nominal capacity of the disks.
Could you reply with the tests you ran and the numbers you got? Seeing
the raw data might help.
With 2x replication, every OSD is sending a copy of the data it
receives to some other OSD. That means that every OSD is on average
now doing 2X the writes, and has 2X the amount of incoming data over
the network. At best, assuming your journals are not on the same
disks, you write speed is going to be at best half of what the disks
actually do.
On the client:
iozone -I -s 128m -r 64k
random
random bkwd record stride
KB reclen write rewrite read reread read write
read rewrite read fwrite frewrite fread freread
131072 64 13414 12220 49396 49821 50181 7589
49657 17670 49227 3601088 4993022 9871395 8526676
iozone -I -s 128m -r 4m
random
random bkwd record stride
KB reclen write rewrite read reread read write
read rewrite read fwrite frewrite fread freread
131072 64 13414 12220 49396 49821 50181 7589
49657 17670 49227 3601088 4993022 9871395 8526676
random
random bkwd record stride
KB reclen write rewrite read reread read write
read rewrite read fwrite frewrite fread freread
131072 4096 24836 19362 109361 109685 109576 24486
74901 27755 58324 2224014 3665214 8260073 8877222
On ceph-01:
iozone -I -s 128m -r 64k
random
random bkwd record stride
KB reclen write rewrite read reread read write
read rewrite read fwrite frewrite fread freread
131072 64 78883 78913 52972 51753 10356 23332 32123
188720 10993 1156601 1524817 3031048 3076885
iozone -I -s 128m -r 4m
random
random bkwd record stride
KB reclen write rewrite read reread read write
read rewrite read fwrite frewrite fread freread
131072 4096 98334 97621 96955 96947 83943 83614
83680 83191 87006 767305 905712 1778933 1790281
On ceph-02:
iozone -I -s 128m -r 64k
random
random bkwd record stride
KB reclen write rewrite read reread read write
read rewrite read fwrite frewrite fread freread
131072 64 48184 49862 44058 43570 9393 19302 17795
138314 7828 1178037 1540321 3055577 3097672
iozone -I -s 128m -r 4m
random
random bkwd record stride
KB reclen write rewrite read reread read write
read rewrite read fwrite frewrite fread freread
131072 4096 62941 62394 63080 61404 56513 55951
57730 54787 55684 769129 913794 1768660 1782636
For the purpose of the test the journal is stored in a 4G ramdisk, the
journal size is 1000 (1G).
Regarding the performance drop I'm not sure to understand why having a
2x replication should result in a 1/2 performance, is ceph writing to
the first replica and then to the second or is it trying to write to
both almost as the same time (there must be some delay because as the
data arrive only on replica #1, it has to be read from the network and
then resent to replica #2). In the former case it's quite logical to
have the performance impact in the later case we should performance
numbers be closer to the native performance of the slowest of the replica.
I still see also rados bench being faster:
rados -p rbd bench 300 write -t 1
2013-04-09 22:55:46.351989min lat: 0.105228 max lat: 0.175903 avg lat:
0.109805
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
300 1 2733 2732 36.4224 36 0.110325 0.109805
Total time run: 300.212541
Total writes made: 2734
Write size: 4194304
Bandwidth (MB/sec): 36.428
Stddev Bandwidth: 2.47917
Max bandwidth (MB/sec): 40
Min bandwidth (MB/sec): 0
Average Latency: 0.109803
Stddev Latency: 0.00182862
Max latency: 0.175903
Min latency: 0.105228
rados -p rbd bench 300 write -t 4
2013-04-09 23:18:01.629757min lat: 0.10903 max lat: 2.39524 avg lat:
0.344683
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
300 4 3484 3480 46.3947 68 0.301892 0.344683
Total time run: 300.130446
Total writes made: 3484
Write size: 4194304
Bandwidth (MB/sec): 46.433
Stddev Bandwidth: 15.839
Max bandwidth (MB/sec): 80
Min bandwidth (MB/sec): 0
Average Latency: 0.344564
Stddev Latency: 0.220953
Max latency: 2.39524
Min latency: 0.10903
rados -p rbd bench 300 write
2013-04-09 23:03:11.630579min lat: 0.115311 max lat: 6.05309 avg lat:
1.44797
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
300 16 3322 3306 44.0749 48 1.80913 1.44797
Total time run: 300.886165
Total writes made: 3323
Write size: 4194304
Bandwidth (MB/sec): 44.176
Stddev Bandwidth: 19.4799
Max bandwidth (MB/sec): 100
Min bandwidth (MB/sec): 0
Average Latency: 1.44864
Stddev Latency: 0.842172
Max latency: 6.05309
Min latency: 0.115311
Matthieu.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com