Re: under performing osd, where to look ?

Matthieu Patou <mat@xxxxxxxxx> · Tue, 09 Apr 2013 23:39:43 -0700

On 04/09/2013 04:53 AM, Mark Nelson wrote:
On 04/09/2013 01:48 AM, Matthieu Patou wrote:
On 04/08/2013 05:55 AM, Mark Nelson wrote:
On 04/08/2013 01:09 AM, Matthieu Patou wrote:
On 04/01/2013 11:26 PM, Matthieu Patou wrote:
On 04/01/2013 05:35 PM, Mark Nelson wrote:
On 03/31/2013 06:37 PM, Matthieu Patou wrote:
Hi,

I was doing some testing with iozone and found that performance 
of an
exported rdb volume where 1/3 of the performance of the hard 
drives.
I was expecting to have a performance penalty but not so important.
I suspect something is not correct in the configuration but I can't
what
exactly.

A couple of things:

1) Were you testing writes?  If so, were your journals on the same
disks as the OSDs?  Ceph writes a full copy of the data to the
journal currently which means you are doing 2 writes for every 1.
This is why some people use fast SSDs for 3-4 journals each.  On the
other hand, that means that losing a journal causes more data
replication and you may lose read performance and capacity if the 
SSD
is taking up a spot that an OSD would have otherwise occupied.
Oh right, no I don't, I'll try to do a quick test with a ramdrive
(this is just test data I can afford to loose them).
I guess that one SSD can stand the load of 4 HDD maybe more but 
you're
right if you try to go too far (ie. maybe 20+ OSD) then you might see
the SSD be your new bottleneck.
Ok I did my homework and moved the journal away, it improved the
performance but I'm still ~ 1/3 of the performance of iozone on the 
hard
drive.

I also briefly added a ramdisk to ceph and benchmarked it and got much
better perf (~180MB/s) but it's still far from the perf of the same
ramdisk benchmarked on one of the osd host.

I'm still convinced that something is not going well because if I run:
rados -p rbd bench 300 write -t 4

I can reach 90% of the performance of the benchmark, the client is
running a stock ubuntu 12.10 kernel:

Linux builder01 3.5.0-23-generic #35-Ubuntu SMP Thu Jan 24 13:15:40 
UTC
2013 x86_64 x86_64 x86_64 GNU/Linu

2) Are you factoring in any replication being done?
What do you mean ?
To my understanding when osd0 is receiving the data it should send
them directly to osd1, or will it write it and then read it ?

IE if you are doing write tests, is the pool you are writing to set to
use 2x replication (ie the default)?  For performance benchmarking I
usually set to to 1x replication (ie no replication) unless I am
explicitly testing a replication scenario.
Yes I was using a 2x replication and I was doing performance
benchmarking with this scenario, of course it mean that as the write has
to be acknowledged by the 2 OSD you'll not have more than min(throughput
osd#1, throughput osd#2) and even a bit less as you are losing some time
to send the request to the first OSD and it will also use some time to
send the data to the second OSD.

But all of this shouldn't cause the system to have a throughput that is
1/3 of the nominal capacity of the disks.

Could you reply with the tests you ran and the numbers you got? Seeing 
the raw data might help.

With 2x replication, every OSD is sending a copy of the data it 
receives to some other OSD.  That means that every OSD is on average 
now doing 2X the writes, and has 2X the amount of incoming data over 
the network.  At best, assuming your journals are not on the same 
disks, you write speed is going to be at best half of what the disks 
actually do.
On the client:

iozone  -I -s 128m  -r 64k

                                                           random 
random    bkwd   record   stride
        KB  reclen   write rewrite    read    reread    read write    
read  rewrite     read   fwrite frewrite   fread  freread
131072      64   13414   12220    49396    49821   50181    7589 
49657    17670    49227  3601088  4993022 9871395  8526676

iozone  -I -s 128m  -r 4m
                                                           random 
random    bkwd   record   stride
        KB  reclen   write rewrite    read    reread    read write    
read  rewrite     read   fwrite frewrite   fread  freread
131072      64   13414   12220    49396    49821   50181    7589 
49657    17670    49227  3601088  4993022 9871395  8526676
                                                           random 
random    bkwd   record   stride
        KB  reclen   write rewrite    read    reread    read write    
read  rewrite     read   fwrite frewrite   fread  freread
131072    4096   24836   19362   109361   109685  109576   24486 
74901    27755    58324  2224014  3665214 8260073  8877222

On ceph-01:
iozone  -I -s 128m  -r 64k
                                                           random 
random    bkwd   record   stride
        KB  reclen   write rewrite    read    reread    read write    
read  rewrite     read   fwrite frewrite   fread  freread
131072      64   78883   78913    52972    51753   10356   23332 32123   
188720    10993  1156601  1524817 3031048  3076885
iozone  -I -s 128m  -r 4m
                                                           random 
random    bkwd   record   stride
        KB  reclen   write rewrite    read    reread    read write    
read  rewrite     read   fwrite frewrite   fread  freread
131072    4096   98334   97621    96955    96947   83943   83614 
83680    83191    87006   767305   905712 1778933  1790281

On ceph-02:
iozone  -I -s 128m  -r 64k
                                                           random 
random    bkwd   record   stride
        KB  reclen   write rewrite    read    reread    read write    
read  rewrite     read   fwrite frewrite   fread  freread
131072      64   48184   49862    44058    43570    9393   19302 17795   
138314     7828  1178037  1540321 3055577  3097672
iozone  -I -s 128m  -r 4m
                                                           random 
random    bkwd   record   stride
        KB  reclen   write rewrite    read    reread    read write    
read  rewrite     read   fwrite frewrite   fread  freread
131072    4096   62941   62394    63080    61404   56513   55951 
57730    54787    55684   769129   913794 1768660  1782636

For the purpose of the test the journal is stored in a 4G ramdisk, the 
journal size is 1000 (1G).

Regarding the performance drop I'm not sure to understand why having a 
2x replication should result in a 1/2 performance, is ceph writing to 
the first replica and then to the second or is it trying to write to 
both almost as the same time (there must be some delay  because as the 
data arrive only on replica #1, it has to be read from the network and 
then resent to replica #2). In the former case it's quite logical to 
have the performance impact in the later case we should performance 
numbers be closer to the native performance of the slowest of the replica.

I still see also rados bench being faster:

rados -p rbd bench 300 write -t 1
2013-04-09 22:55:46.351989min lat: 0.105228 max lat: 0.175903 avg lat: 
0.109805
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat avg lat
   300       1      2733      2732   36.4224        36  0.110325 0.109805
 Total time run:         300.212541
Total writes made:      2734
Write size:             4194304
Bandwidth (MB/sec):     36.428

Stddev Bandwidth:       2.47917
Max bandwidth (MB/sec): 40
Min bandwidth (MB/sec): 0
Average Latency:        0.109803
Stddev Latency:         0.00182862
Max latency:            0.175903
Min latency:            0.105228

rados -p rbd bench 300 write -t 4

2013-04-09 23:18:01.629757min lat: 0.10903 max lat: 2.39524 avg lat: 
0.344683
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat avg lat
   300       4      3484      3480   46.3947        68  0.301892 0.344683
 Total time run:         300.130446
Total writes made:      3484
Write size:             4194304
Bandwidth (MB/sec):     46.433

Stddev Bandwidth:       15.839
Max bandwidth (MB/sec): 80
Min bandwidth (MB/sec): 0
Average Latency:        0.344564
Stddev Latency:         0.220953
Max latency:            2.39524
Min latency:            0.10903

rados -p rbd bench 300 write
2013-04-09 23:03:11.630579min lat: 0.115311 max lat: 6.05309 avg lat: 
1.44797
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat avg lat
   300      16      3322      3306   44.0749        48   1.80913 1.44797
 Total time run:         300.886165
Total writes made:      3323
Write size:             4194304
Bandwidth (MB/sec):     44.176

Stddev Bandwidth:       19.4799
Max bandwidth (MB/sec): 100
Min bandwidth (MB/sec): 0
Average Latency:        1.44864
Stddev Latency:         0.842172
Max latency:            6.05309
Min latency:            0.115311

Matthieu.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com