Re: Cluster Performance very Poor

German Anders <ganders@xxxxxxxxxxxx> · Fri, 27 Dec 2013 13:49:14 -0500

Hi Mark,
   Thanks a lot for the quick answer. So you said to put the OSD and the Journal on the same disk, that would be 

ceph-deploy osd prepare ceph-node01:sda:sda
ceph-deploy osd activate ceph-node01:/dev/sda1:/dev/sda2

      And then take the 500GB OSD's out of the ceph-ppol, do you had the commands to do those movements?

Thanks a lot,

Best regards,

German Anders

--- Original message --- 
Asunto: Re:  Cluster Performance very Poor 
De: Mark Nelson <mark.nelson@xxxxxxxxxxx> 
Para:  <ceph-users@xxxxxxxxxxxxxx> 
Fecha: Friday, 27/12/2013 15:39

On 12/27/2013 12:19 PM, German Anders wrote:
      Hi Cephers,

                I've run a rados bench to measure the throughput of the cluster,
 and found that the performance is really poor:

 The setup is the following:

 OS: Ubuntu 12.10 Server 64 bits

 ceph-node01(mon)    10.77.0.101    ProLiant BL460c G7    32GB    8 x 2 Ghz
                                                                  10.1.1.151    D2200sb Storage Blade
 (Firmware: 2.30)
 ceph-node02(mon)    10.77.0.102    ProLiant BL460c G7    64GB    8 x 2 Ghz
                                                                  10.1.1.152    D2200sb Storage Blade
 (Firmware: 2.30)
 ceph-node03(mon)    10.77.0.103    ProLiant BL460c G6    32GB    8 x 2 Ghz
                                                                  10.1.1.153    D2200sb Storage Blade
 (Firmware: 2.30)
 ceph-node04            10.77.0.104    ProLiant BL460c G7    32GB    8 x
 2 Ghz
                                                                10.1.1.154    D2200sb Storage Blade
 (Firmware: 2.30)
 ceph-node05(deploy)    10.77.0.105    ProLiant BL460c G6    32GB    8 x
 2 Ghz
                                                                        10.1.1.155    D2200sb Storage
 Blade    (Firmware: 2.30)

If your servers have controllers with writeback cache, please make sure 
it is enabled as that will likely help.

 ceph-node01:

            /dev/sda    73G    (OSD)
            /dev/sdb    73G    (OSD)
            /dev/sdc    73G    (OSD)
            /dev/sdd    73G    (OSD)
            /dev/sde    73G    (OSD)
            /dev/sdf    73G    (OSD)
            /dev/sdg    73G    (OSD)
            /dev/sdh    73G    (OSD)
            /dev/sdi    73G    (OSD)
            /dev/sdj    73G    (Journal)
            /dev/sdk    500G    (OSD)
            /dev/sdl    500G    (OSD)
            /dev/sdn    146G    (Journal)

 ceph-node02:

            /dev/sda    73G    (OSD)
            /dev/sdb    73G    (OSD)
            /dev/sdc    73G    (OSD)
            /dev/sdd    73G    (OSD)
            /dev/sde    73G    (OSD)
            /dev/sdf    73G    (OSD)
            /dev/sdg    73G    (OSD)
            /dev/sdh    73G    (OSD)
            /dev/sdi    73G    (OSD)
            /dev/sdj    73G    (Journal)
            /dev/sdk    500G    (OSD)
            /dev/sdl    500G    (OSD)
            /dev/sdn    146G    (Journal)

 ceph-node03:

            /dev/sda    73G    (OSD)
            /dev/sdb    73G    (OSD)
            /dev/sdc    73G    (OSD)
            /dev/sdd    73G    (OSD)
            /dev/sde    73G    (OSD)
            /dev/sdf    73G    (OSD)
            /dev/sdg    73G    (OSD)
            /dev/sdh    73G    (OSD)
            /dev/sdi    73G    (OSD)
            /dev/sdj    73G    (Journal)
            /dev/sdk    500G    (OSD)
            /dev/sdl    500G    (OSD)
            /dev/sdn    73G    (Journal)

 ceph-node04:

            /dev/sda    73G    (OSD)
            /dev/sdb    73G    (OSD)
            /dev/sdc    73G    (OSD)
            /dev/sdd    73G    (OSD)
            /dev/sde    73G    (OSD)
            /dev/sdf    73G     (OSD)
            /dev/sdg    73G    (OSD)
            /dev/sdh    73G    (OSD)
            /dev/sdi    73G    (OSD)
            /dev/sdj    73G    (Journal)
            /dev/sdk    500G    (OSD)
            /dev/sdl    500G    (OSD)
            /dev/sdn    146G    (Journal)

 ceph-node05:

            /dev/sda    73G    (OSD)
            /dev/sdb    73G    (OSD)
            /dev/sdc    73G    (OSD)
            /dev/sdd    73G    (OSD)
            /dev/sde    73G    (OSD)
            /dev/sdf    73G    (OSD)
            /dev/sdg    73G    (OSD)
            /dev/sdh    73G    (OSD)
            /dev/sdi    73G    (OSD)
            /dev/sdj    73G    (Journal)
            /dev/sdk    500G    (OSD)
            /dev/sdl    500G    (OSD)
            /dev/sdn    73G    (Journal)

Am I correct in assuming that you've put all of your journals for every 
disk in each node on two spinning disks?  This is going to be quite 
slow, because Ceph does a full write of the data the journal for every 
real write.  The general solution is to either use SSDs for journals 
(preferably multiple fast SSDs with high write endurance and only 3-6 
OSD journals each), or put the journals on a partition on the data disk.

 And the OSD tree is:

 root@ceph-node03:/home/ceph# ceph osd tree
 # id    weight    type name    up/down    reweight
 -1    7.27    root default
 -2    1.15        host ceph-node01
 12    0.06999            osd.12    up    1
 13    0.06999            osd.13    up    1
 14    0.06999            osd.14    up    1
 15    0.06999            osd.15    up    1
 16    0.06999            osd.16    up    1
 17    0.06999            osd.17    up    1
 18    0.06999            osd.18    up    1
 19    0.06999            osd.19    up    1
 20    0.06999            osd.20    up    1
 21    0.45            osd.21    up    1
 22    0.06999            osd.22    up    1
 -3    1.53        host ceph-node02
 23    0.06999            osd.23    up    1
 24    0.06999            osd.24    up    1
 25    0.06999            osd.25    up    1
 26    0.06999            osd.26    up    1
 27    0.06999            osd.27    up    1
 28    0.06999            osd.28    up    1
 29    0.06999            osd.29    up    1
 30    0.06999            osd.30    up    1
 31    0.06999            osd.31    up    1
 32    0.45            osd.32    up    1
 33    0.45            osd.33    up    1
 -4    1.53        host ceph-node03
 34    0.06999            osd.34    up    1
 35    0.06999            osd.35    up    1
 36    0.06999            osd.36    up    1
 37    0.06999            osd.37    up    1
 38    0.06999            osd.38    up    1
 39    0.06999            osd.39    up    1
 40    0.06999            osd.40    up    1
 41    0.06999            osd.41    up    1
 42    0.06999            osd.42    up    1
 43    0.45            osd.43    up    1
 44    0.45            osd.44    up    1
 -5    1.53        host ceph-node04
 0    0.06999            osd.0    up    1
 1    0.06999            osd.1    up    1
 2    0.06999            osd.2    up    1
 3    0.06999            osd.3    up    1
 4    0.06999            osd.4    up    1
 5    0.06999            osd.5    up    1
 6    0.06999            osd.6    up    1
 7    0.06999            osd.7    up    1
 8    0.06999            osd.8    up    1
 9    0.45            osd.9    up    1
 10    0.45            osd.10    up    1
 -6    1.53        host ceph-node05
 11    0.06999            osd.11    up    1
 45    0.06999            osd.45    up    1
 46    0.06999            osd.46    up    1
 47    0.06999            osd.47    up    1
 48    0.06999            osd.48    up    1
 49    0.06999            osd.49    up    1
 50    0.06999            osd.50    up    1
 51    0.06999            osd.51    up    1
 52    0.06999            osd.52    up    1
 53    0.45            osd.53    up    1
 54    0.45            osd.54    up    1

Based on this, it appears your 500GB drives are weighted much higher 
than the 73GB drives.  This will help even data distribution out, but 
unfortunately will cause the system to be slower if all of the OSDs are 
in the same pool.  What this does is cause the 500GB drives to get a 
higher proportion of the writes than the other drives, but those drives 
are almost certainly no faster than the other ones.  Because there is a 
limited number of outstanding IOs you can have (due to memory 
constraints), eventually all outstanding IOs will be waiting on the 
500GB disks while the 73GB disks mostly sit around waiting for work.

What I'd suggest doing is putting all of your 73 disks in the same pool 
and your 500GB disks in another pool.  I suspect that if you do that and 
put your journals on the first partition of each disk, you'll see some 
improvement in your benchmark results.

 And the result:

 root@ceph-node03:/home/ceph# rados bench -p ceph-cloud 20 write -t 10
      Maintaining 10 concurrent writes of 4194304 bytes for up to 20 seconds
 or 0 objects
      Object prefix: benchmark_data_ceph-node03_29727
          sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
              0       0         0         0         0         0         -         0
              1      10        30        20   79.9465        80  0.159295  0.378849
              2      10        52        42   83.9604        88  0.719616  0.430293
              3      10        74        64   85.2991        88  0.487685  0.412956
              4      10        97        87   86.9676        92  0.351122  0.418814
              5      10       123       113   90.3679       104  0.317011  0.418876
              6      10       147       137   91.3012        96  0.562112  0.418178
              7      10       172       162   92.5398       100  0.691045  0.413416
              8      10       197       187    93.469       100  0.459424  0.415459
              9      10       222       212   94.1915       100  0.798889  0.416093
            10      10       248       238   95.1697       104  0.440002  0.415609
            11      10       267       257   93.4252        76   0.48959   0.41531
            12      10       289       279   92.9707        88  0.524622  0.420145
            13      10       313       303   93.2016        96   1.02104  0.423955
            14      10       336       326   93.1136        92  0.477328  0.420684
            15      10       359       349    93.037        92  0.591118  0.418589
            16      10       383       373   93.2204        96  0.600392  0.421916
            17      10       407       397   93.3812        96  0.240166  0.419829
            18      10       431       421    93.526        96  0.746706  0.420971
            19      10       457       447   94.0757       104  0.237565  0.419025
 2013-12-27 13:13:21.817874min lat: 0.101352 max lat: 1.81426 avg lat:
 0.418242
          sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
            20      10       480       470   93.9709        92  0.489254  0.418242
      Total time run:         20.258064
 Total writes made:      481
 Write size:             4194304
 Bandwidth (MB/sec):     94.975

 Stddev Bandwidth:       21.7799
 Max bandwidth (MB/sec): 104
 Min bandwidth (MB/sec): 0
 Average Latency:        0.420573
 Stddev Latency:         0.226378
 Max latency:            1.81426
 Min latency:            0.101352
 root@ceph-node03:/home/ceph#

 Thanks in advance,

 Best regards,

 *German Anders*

 _______________________________________________
 ceph-users mailing list
 ceph-users@xxxxxxxxxxxxxx
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com