Re: Cluster Performance very Poor

Mark Nelson <mark.nelson@xxxxxxxxxxx> · Fri, 27 Dec 2013 12:39:09 -0600

On 12/27/2013 12:19 PM, German Anders wrote:
  Hi Cephers,

       I've run a rados bench to measure the throughput of the cluster,
and found that the performance is really poor:

The setup is the following:

OS: Ubuntu 12.10 Server 64 bits

ceph-node01(mon)    10.77.0.101    ProLiant BL460c G7    32GB    8 x 2 Ghz
                                10.1.1.151    D2200sb Storage Blade
(Firmware: 2.30)
ceph-node02(mon)    10.77.0.102    ProLiant BL460c G7    64GB    8 x 2 Ghz
                                10.1.1.152    D2200sb Storage Blade
(Firmware: 2.30)
ceph-node03(mon)    10.77.0.103    ProLiant BL460c G6    32GB    8 x 2 Ghz
                                10.1.1.153    D2200sb Storage Blade
(Firmware: 2.30)
ceph-node04            10.77.0.104    ProLiant BL460c G7    32GB    8 x
2 Ghz
                               10.1.1.154    D2200sb Storage Blade
(Firmware: 2.30)
ceph-node05(deploy)    10.77.0.105    ProLiant BL460c G6    32GB    8 x
2 Ghz
                                   10.1.1.155    D2200sb Storage
Blade    (Firmware: 2.30)
If your servers have controllers with writeback cache, please make sure 
it is enabled as that will likely help.
ceph-node01:

     /dev/sda    73G    (OSD)
     /dev/sdb    73G    (OSD)
     /dev/sdc    73G    (OSD)
     /dev/sdd    73G    (OSD)
     /dev/sde    73G    (OSD)
     /dev/sdf    73G    (OSD)
     /dev/sdg    73G    (OSD)
     /dev/sdh    73G    (OSD)
     /dev/sdi    73G    (OSD)
     /dev/sdj    73G    (Journal)
     /dev/sdk    500G    (OSD)
     /dev/sdl    500G    (OSD)
     /dev/sdn    146G    (Journal)

ceph-node02:

     /dev/sda    73G    (OSD)
     /dev/sdb    73G    (OSD)
     /dev/sdc    73G    (OSD)
     /dev/sdd    73G    (OSD)
     /dev/sde    73G    (OSD)
     /dev/sdf    73G    (OSD)
     /dev/sdg    73G    (OSD)
     /dev/sdh    73G    (OSD)
     /dev/sdi    73G    (OSD)
     /dev/sdj    73G    (Journal)
     /dev/sdk    500G    (OSD)
     /dev/sdl    500G    (OSD)
     /dev/sdn    146G    (Journal)

ceph-node03:

     /dev/sda    73G    (OSD)
     /dev/sdb    73G    (OSD)
     /dev/sdc    73G    (OSD)
     /dev/sdd    73G    (OSD)
     /dev/sde    73G    (OSD)
     /dev/sdf    73G    (OSD)
     /dev/sdg    73G    (OSD)
     /dev/sdh    73G    (OSD)
     /dev/sdi    73G    (OSD)
     /dev/sdj    73G    (Journal)
     /dev/sdk    500G    (OSD)
     /dev/sdl    500G    (OSD)
     /dev/sdn    73G    (Journal)

ceph-node04:

     /dev/sda    73G    (OSD)
     /dev/sdb    73G    (OSD)
     /dev/sdc    73G    (OSD)
     /dev/sdd    73G    (OSD)
     /dev/sde    73G    (OSD)
     /dev/sdf    73G     (OSD)
     /dev/sdg    73G    (OSD)
     /dev/sdh    73G    (OSD)
     /dev/sdi    73G    (OSD)
     /dev/sdj    73G    (Journal)
     /dev/sdk    500G    (OSD)
     /dev/sdl    500G    (OSD)
     /dev/sdn    146G    (Journal)

ceph-node05:

     /dev/sda    73G    (OSD)
     /dev/sdb    73G    (OSD)
     /dev/sdc    73G    (OSD)
     /dev/sdd    73G    (OSD)
     /dev/sde    73G    (OSD)
     /dev/sdf    73G    (OSD)
     /dev/sdg    73G    (OSD)
     /dev/sdh    73G    (OSD)
     /dev/sdi    73G    (OSD)
     /dev/sdj    73G    (Journal)
     /dev/sdk    500G    (OSD)
     /dev/sdl    500G    (OSD)
     /dev/sdn    73G    (Journal)
Am I correct in assuming that you've put all of your journals for every 
disk in each node on two spinning disks?  This is going to be quite 
slow, because Ceph does a full write of the data the journal for every 
real write.  The general solution is to either use SSDs for journals 
(preferably multiple fast SSDs with high write endurance and only 3-6 
OSD journals each), or put the journals on a partition on the data disk.
And the OSD tree is:

root@ceph-node03:/home/ceph# ceph osd tree
# id    weight    type name    up/down    reweight
-1    7.27    root default
-2    1.15        host ceph-node01
12    0.06999            osd.12    up    1
13    0.06999            osd.13    up    1
14    0.06999            osd.14    up    1
15    0.06999            osd.15    up    1
16    0.06999            osd.16    up    1
17    0.06999            osd.17    up    1
18    0.06999            osd.18    up    1
19    0.06999            osd.19    up    1
20    0.06999            osd.20    up    1
21    0.45            osd.21    up    1
22    0.06999            osd.22    up    1
-3    1.53        host ceph-node02
23    0.06999            osd.23    up    1
24    0.06999            osd.24    up    1
25    0.06999            osd.25    up    1
26    0.06999            osd.26    up    1
27    0.06999            osd.27    up    1
28    0.06999            osd.28    up    1
29    0.06999            osd.29    up    1
30    0.06999            osd.30    up    1
31    0.06999            osd.31    up    1
32    0.45            osd.32    up    1
33    0.45            osd.33    up    1
-4    1.53        host ceph-node03
34    0.06999            osd.34    up    1
35    0.06999            osd.35    up    1
36    0.06999            osd.36    up    1
37    0.06999            osd.37    up    1
38    0.06999            osd.38    up    1
39    0.06999            osd.39    up    1
40    0.06999            osd.40    up    1
41    0.06999            osd.41    up    1
42    0.06999            osd.42    up    1
43    0.45            osd.43    up    1
44    0.45            osd.44    up    1
-5    1.53        host ceph-node04
0    0.06999            osd.0    up    1
1    0.06999            osd.1    up    1
2    0.06999            osd.2    up    1
3    0.06999            osd.3    up    1
4    0.06999            osd.4    up    1
5    0.06999            osd.5    up    1
6    0.06999            osd.6    up    1
7    0.06999            osd.7    up    1
8    0.06999            osd.8    up    1
9    0.45            osd.9    up    1
10    0.45            osd.10    up    1
-6    1.53        host ceph-node05
11    0.06999            osd.11    up    1
45    0.06999            osd.45    up    1
46    0.06999            osd.46    up    1
47    0.06999            osd.47    up    1
48    0.06999            osd.48    up    1
49    0.06999            osd.49    up    1
50    0.06999            osd.50    up    1
51    0.06999            osd.51    up    1
52    0.06999            osd.52    up    1
53    0.45            osd.53    up    1
54    0.45            osd.54    up    1
Based on this, it appears your 500GB drives are weighted much higher 
than the 73GB drives.  This will help even data distribution out, but 
unfortunately will cause the system to be slower if all of the OSDs are 
in the same pool.  What this does is cause the 500GB drives to get a 
higher proportion of the writes than the other drives, but those drives 
are almost certainly no faster than the other ones.  Because there is a 
limited number of outstanding IOs you can have (due to memory 
constraints), eventually all outstanding IOs will be waiting on the 
500GB disks while the 73GB disks mostly sit around waiting for work.
What I'd suggest doing is putting all of your 73 disks in the same pool 
and your 500GB disks in another pool.  I suspect that if you do that and 
put your journals on the first partition of each disk, you'll see some 
improvement in your benchmark results.

And the result:

root@ceph-node03:/home/ceph# rados bench -p ceph-cloud 20 write -t 10
  Maintaining 10 concurrent writes of 4194304 bytes for up to 20 seconds
or 0 objects
  Object prefix: benchmark_data_ceph-node03_29727
    sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
      0       0         0         0         0         0         -         0
      1      10        30        20   79.9465        80  0.159295  0.378849
      2      10        52        42   83.9604        88  0.719616  0.430293
      3      10        74        64   85.2991        88  0.487685  0.412956
      4      10        97        87   86.9676        92  0.351122  0.418814
      5      10       123       113   90.3679       104  0.317011  0.418876
      6      10       147       137   91.3012        96  0.562112  0.418178
      7      10       172       162   92.5398       100  0.691045  0.413416
      8      10       197       187    93.469       100  0.459424  0.415459
      9      10       222       212   94.1915       100  0.798889  0.416093
     10      10       248       238   95.1697       104  0.440002  0.415609
     11      10       267       257   93.4252        76   0.48959   0.41531
     12      10       289       279   92.9707        88  0.524622  0.420145
     13      10       313       303   93.2016        96   1.02104  0.423955
     14      10       336       326   93.1136        92  0.477328  0.420684
     15      10       359       349    93.037        92  0.591118  0.418589
     16      10       383       373   93.2204        96  0.600392  0.421916
     17      10       407       397   93.3812        96  0.240166  0.419829
     18      10       431       421    93.526        96  0.746706  0.420971
     19      10       457       447   94.0757       104  0.237565  0.419025
2013-12-27 13:13:21.817874min lat: 0.101352 max lat: 1.81426 avg lat:
0.418242
    sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
     20      10       480       470   93.9709        92  0.489254  0.418242
  Total time run:         20.258064
Total writes made:      481
Write size:             4194304
Bandwidth (MB/sec):     94.975

Stddev Bandwidth:       21.7799
Max bandwidth (MB/sec): 104
Min bandwidth (MB/sec): 0
Average Latency:        0.420573
Stddev Latency:         0.226378
Max latency:            1.81426
Min latency:            0.101352
root@ceph-node03:/home/ceph#

Thanks in advance,

Best regards,

*German Anders*

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com