On 12/27/2013 12:19 PM, German Anders wrote:
Hi Cephers, I've run a rados bench to measure the throughput of the cluster, and found that the performance is really poor: The setup is the following: OS: Ubuntu 12.10 Server 64 bits ceph-node01(mon) 10.77.0.101 ProLiant BL460c G7 32GB 8 x 2 Ghz 10.1.1.151 D2200sb Storage Blade (Firmware: 2.30) ceph-node02(mon) 10.77.0.102 ProLiant BL460c G7 64GB 8 x 2 Ghz 10.1.1.152 D2200sb Storage Blade (Firmware: 2.30) ceph-node03(mon) 10.77.0.103 ProLiant BL460c G6 32GB 8 x 2 Ghz 10.1.1.153 D2200sb Storage Blade (Firmware: 2.30) ceph-node04 10.77.0.104 ProLiant BL460c G7 32GB 8 x 2 Ghz 10.1.1.154 D2200sb Storage Blade (Firmware: 2.30) ceph-node05(deploy) 10.77.0.105 ProLiant BL460c G6 32GB 8 x 2 Ghz 10.1.1.155 D2200sb Storage Blade (Firmware: 2.30)
If your servers have controllers with writeback cache, please make sure it is enabled as that will likely help.
ceph-node01: /dev/sda 73G (OSD) /dev/sdb 73G (OSD) /dev/sdc 73G (OSD) /dev/sdd 73G (OSD) /dev/sde 73G (OSD) /dev/sdf 73G (OSD) /dev/sdg 73G (OSD) /dev/sdh 73G (OSD) /dev/sdi 73G (OSD) /dev/sdj 73G (Journal) /dev/sdk 500G (OSD) /dev/sdl 500G (OSD) /dev/sdn 146G (Journal) ceph-node02: /dev/sda 73G (OSD) /dev/sdb 73G (OSD) /dev/sdc 73G (OSD) /dev/sdd 73G (OSD) /dev/sde 73G (OSD) /dev/sdf 73G (OSD) /dev/sdg 73G (OSD) /dev/sdh 73G (OSD) /dev/sdi 73G (OSD) /dev/sdj 73G (Journal) /dev/sdk 500G (OSD) /dev/sdl 500G (OSD) /dev/sdn 146G (Journal) ceph-node03: /dev/sda 73G (OSD) /dev/sdb 73G (OSD) /dev/sdc 73G (OSD) /dev/sdd 73G (OSD) /dev/sde 73G (OSD) /dev/sdf 73G (OSD) /dev/sdg 73G (OSD) /dev/sdh 73G (OSD) /dev/sdi 73G (OSD) /dev/sdj 73G (Journal) /dev/sdk 500G (OSD) /dev/sdl 500G (OSD) /dev/sdn 73G (Journal) ceph-node04: /dev/sda 73G (OSD) /dev/sdb 73G (OSD) /dev/sdc 73G (OSD) /dev/sdd 73G (OSD) /dev/sde 73G (OSD) /dev/sdf 73G (OSD) /dev/sdg 73G (OSD) /dev/sdh 73G (OSD) /dev/sdi 73G (OSD) /dev/sdj 73G (Journal) /dev/sdk 500G (OSD) /dev/sdl 500G (OSD) /dev/sdn 146G (Journal) ceph-node05: /dev/sda 73G (OSD) /dev/sdb 73G (OSD) /dev/sdc 73G (OSD) /dev/sdd 73G (OSD) /dev/sde 73G (OSD) /dev/sdf 73G (OSD) /dev/sdg 73G (OSD) /dev/sdh 73G (OSD) /dev/sdi 73G (OSD) /dev/sdj 73G (Journal) /dev/sdk 500G (OSD) /dev/sdl 500G (OSD) /dev/sdn 73G (Journal)
Am I correct in assuming that you've put all of your journals for every disk in each node on two spinning disks? This is going to be quite slow, because Ceph does a full write of the data the journal for every real write. The general solution is to either use SSDs for journals (preferably multiple fast SSDs with high write endurance and only 3-6 OSD journals each), or put the journals on a partition on the data disk.
And the OSD tree is: root@ceph-node03:/home/ceph# ceph osd tree # id weight type name up/down reweight -1 7.27 root default -2 1.15 host ceph-node01 12 0.06999 osd.12 up 1 13 0.06999 osd.13 up 1 14 0.06999 osd.14 up 1 15 0.06999 osd.15 up 1 16 0.06999 osd.16 up 1 17 0.06999 osd.17 up 1 18 0.06999 osd.18 up 1 19 0.06999 osd.19 up 1 20 0.06999 osd.20 up 1 21 0.45 osd.21 up 1 22 0.06999 osd.22 up 1 -3 1.53 host ceph-node02 23 0.06999 osd.23 up 1 24 0.06999 osd.24 up 1 25 0.06999 osd.25 up 1 26 0.06999 osd.26 up 1 27 0.06999 osd.27 up 1 28 0.06999 osd.28 up 1 29 0.06999 osd.29 up 1 30 0.06999 osd.30 up 1 31 0.06999 osd.31 up 1 32 0.45 osd.32 up 1 33 0.45 osd.33 up 1 -4 1.53 host ceph-node03 34 0.06999 osd.34 up 1 35 0.06999 osd.35 up 1 36 0.06999 osd.36 up 1 37 0.06999 osd.37 up 1 38 0.06999 osd.38 up 1 39 0.06999 osd.39 up 1 40 0.06999 osd.40 up 1 41 0.06999 osd.41 up 1 42 0.06999 osd.42 up 1 43 0.45 osd.43 up 1 44 0.45 osd.44 up 1 -5 1.53 host ceph-node04 0 0.06999 osd.0 up 1 1 0.06999 osd.1 up 1 2 0.06999 osd.2 up 1 3 0.06999 osd.3 up 1 4 0.06999 osd.4 up 1 5 0.06999 osd.5 up 1 6 0.06999 osd.6 up 1 7 0.06999 osd.7 up 1 8 0.06999 osd.8 up 1 9 0.45 osd.9 up 1 10 0.45 osd.10 up 1 -6 1.53 host ceph-node05 11 0.06999 osd.11 up 1 45 0.06999 osd.45 up 1 46 0.06999 osd.46 up 1 47 0.06999 osd.47 up 1 48 0.06999 osd.48 up 1 49 0.06999 osd.49 up 1 50 0.06999 osd.50 up 1 51 0.06999 osd.51 up 1 52 0.06999 osd.52 up 1 53 0.45 osd.53 up 1 54 0.45 osd.54 up 1
Based on this, it appears your 500GB drives are weighted much higher than the 73GB drives. This will help even data distribution out, but unfortunately will cause the system to be slower if all of the OSDs are in the same pool. What this does is cause the 500GB drives to get a higher proportion of the writes than the other drives, but those drives are almost certainly no faster than the other ones. Because there is a limited number of outstanding IOs you can have (due to memory constraints), eventually all outstanding IOs will be waiting on the 500GB disks while the 73GB disks mostly sit around waiting for work.
What I'd suggest doing is putting all of your 73 disks in the same pool and your 500GB disks in another pool. I suspect that if you do that and put your journals on the first partition of each disk, you'll see some improvement in your benchmark results.
And the result: root@ceph-node03:/home/ceph# rados bench -p ceph-cloud 20 write -t 10 Maintaining 10 concurrent writes of 4194304 bytes for up to 20 seconds or 0 objects Object prefix: benchmark_data_ceph-node03_29727 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 10 30 20 79.9465 80 0.159295 0.378849 2 10 52 42 83.9604 88 0.719616 0.430293 3 10 74 64 85.2991 88 0.487685 0.412956 4 10 97 87 86.9676 92 0.351122 0.418814 5 10 123 113 90.3679 104 0.317011 0.418876 6 10 147 137 91.3012 96 0.562112 0.418178 7 10 172 162 92.5398 100 0.691045 0.413416 8 10 197 187 93.469 100 0.459424 0.415459 9 10 222 212 94.1915 100 0.798889 0.416093 10 10 248 238 95.1697 104 0.440002 0.415609 11 10 267 257 93.4252 76 0.48959 0.41531 12 10 289 279 92.9707 88 0.524622 0.420145 13 10 313 303 93.2016 96 1.02104 0.423955 14 10 336 326 93.1136 92 0.477328 0.420684 15 10 359 349 93.037 92 0.591118 0.418589 16 10 383 373 93.2204 96 0.600392 0.421916 17 10 407 397 93.3812 96 0.240166 0.419829 18 10 431 421 93.526 96 0.746706 0.420971 19 10 457 447 94.0757 104 0.237565 0.419025 2013-12-27 13:13:21.817874min lat: 0.101352 max lat: 1.81426 avg lat: 0.418242 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 20 10 480 470 93.9709 92 0.489254 0.418242 Total time run: 20.258064 Total writes made: 481 Write size: 4194304 Bandwidth (MB/sec): 94.975 Stddev Bandwidth: 21.7799 Max bandwidth (MB/sec): 104 Min bandwidth (MB/sec): 0 Average Latency: 0.420573 Stddev Latency: 0.226378 Max latency: 1.81426 Min latency: 0.101352 root@ceph-node03:/home/ceph# Thanks in advance, Best regards, *German Anders* _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com