Re: Cluster Performance very Poor

Sage Weil <sage@xxxxxxxxxxx> · Fri, 27 Dec 2013 15:19:26 -0800 (PST)

On Fri, 27 Dec 2013, German Anders wrote:
> Hi Mark,
>             I've already make those changes but the performance is almost
> the same, i make another test with a DD statement and the results were the
> same (i've used all of the 73GB disks for the OSD's and also put the Journal
> inside the OSD device), also noticed that the network is at Gb:

Wait... this is a 1Gbps network?  And you're getting around 100 MB/sec 
from a single client?  That is about right given what the client NIC is 
capable of.

sage

> 
> ceph@ceph-node04:~$ sudo rbd -m 10.1.1.151 -p ceph-cloud --size 102400
> create rbdCloud -k /etc/ceph/ceph.client.admin.keyring
> ceph@ceph-node04:~$ sudo rbd map -m 10.1.1.151 rbdCloud --pool ceph-cloud
> --id admin -k /etc/ceph/ceph.client.admin.keyring
> ceph@ceph-node04:~$ sudo mkdir /mnt/rbdCloud
> ceph@ceph-node04:~$ sudo mkfs.xfs -l size=64m,lazy-count=1 -f
> /dev/rbd/ceph-cloud/rbdCloud
> log stripe unit (4194304 bytes) is too large (maximum is 256KiB)
> log stripe unit adjusted to 32KiB
> meta-data=/dev/rbd/ceph-cloud/rbdCloud isize=256    agcount=17,
> agsize=1637376 blks
>          =                       sectsz=512   attr=2, projid32bit=0
> data     =                       bsize=4096   blocks=26214400, imaxpct=25
>          =                       sunit=1024   swidth=1024 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal log           bsize=4096   blocks=16384, version=2
>          =                       sectsz=512   sunit=8 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> ceph@ceph-node04:~$
> ceph@ceph-node04:~$ sudo mount /dev/rbd/ceph-cloud/rbdCloud /mnt/rbdCloud
> ceph@ceph-node04:~$ cd /mnt/rbdCloud
> ceph@ceph-node04:/mnt/rbdCloud$
> ceph@ceph-node04:/mnt/rbdCloud$ for i in 1 2 3 4; do sudo dd if=/dev/zero
> of=a bs=1M count=1000 conv=fdatasync; done
> 1000+0 records in
> 1000+0 records out
> 1048576000 bytes (1.0 GB) copied, 10.2545 s, 102 MB/s
> 1000+0 records in
> 1000+0 records out
> 1048576000 bytes (1.0 GB) copied, 10.0554 s, 104 MB/s
> 1000+0 records in
> 1000+0 records out
> 1048576000 bytes (1.0 GB) copied, 10.2352 s, 102 MB/s
> 1000+0 records in
> 1000+0 records out
> 1048576000 bytes (1.0 GB) copied, 10.1197 s, 104 MB/s
> ceph@ceph-node04:/mnt/rbdCloud$
> 
> OSD tree:
> 
> ceph@ceph-node05:~/ceph-cluster-prd$ sudo ceph osd tree
> # id    weight    type name    up/down    reweight
> -1    3.43    root default
> -2    0.6299        host ceph-node01
> 12    0.06999            osd.12    up    1   
> 13    0.06999            osd.13    up    1   
> 14    0.06999            osd.14    up    1   
> 15    0.06999            osd.15    up    1   
> 16    0.06999            osd.16    up    1   
> 17    0.06999            osd.17    up    1   
> 18    0.06999            osd.18    up    1   
> 19    0.06999            osd.19    up    1   
> 20    0.06999            osd.20    up    1   
> -3    0.6999        host ceph-node02
> 22    0.06999            osd.22    up    1   
> 23    0.06999            osd.23    up    1   
> 24    0.06999            osd.24    up    1   
> 25    0.06999            osd.25    up    1   
> 26    0.06999            osd.26    up    1   
> 27    0.06999            osd.27    up    1   
> 28    0.06999            osd.28    up    1   
> 29    0.06999            osd.29    up    1   
> 30    0.06999            osd.30    up    1   
> 31    0.06999            osd.31    up    1   
> -4    0.6999        host ceph-node03
> 32    0.06999            osd.32    up    1   
> 33    0.06999            osd.33    up    1   
> 34    0.06999            osd.34    up    1   
> 35    0.06999            osd.35    up    1   
> 36    0.06999            osd.36    up    1   
> 37    0.06999            osd.37    up    1   
> 38    0.06999            osd.38    up    1   
> 39    0.06999            osd.39    up    1   
> 40    0.06999            osd.40    up    1   
> 41    0.06999            osd.41    up    1   
> -5    0.6999        host ceph-node04
> 0    0.06999            osd.0    up    1   
> 1    0.06999            osd.1    up    1   
> 2    0.06999            osd.2    up    1   
> 3    0.06999            osd.3    up    1   
> 4    0.06999            osd.4    up    1   
> 5    0.06999            osd.5    up    1   
> 6    0.06999            osd.6    up    1   
> 7    0.06999            osd.7    up    1   
> 8    0.06999            osd.8    up    1   
> 9    0.06999            osd.9    up    1   
> -6    0.6999        host ceph-node05
> 10    0.06999            osd.10    up    1   
> 11    0.06999            osd.11    up    1   
> 42    0.06999            osd.42    up    1   
> 43    0.06999            osd.43    up    1   
> 44    0.06999            osd.44    up    1   
> 45    0.06999            osd.45    up    1   
> 46    0.06999            osd.46    up    1   
> 47    0.06999            osd.47    up    1   
> 48    0.06999            osd.48    up    1   
> 49    0.06999            osd.49    up    1
> 
> 
> Any ideas?
> 
> Thanks in advance,
>  
> 
> German Anders
> 
> 
> 
> 
> 
> 
> 
>  
>       --- Original message ---
>       Asunto: Re:  Cluster Performance very Poor
>       De: Mark Nelson <mark.nelson@xxxxxxxxxxx>
>       Para: <ceph-users@xxxxxxxxxxxxxx>
>       Fecha: Friday, 27/12/2013 15:39
> 
>       On 12/27/2013 12:19 PM, German Anders wrote:
>                 Hi Cephers,
> 
>                      I've run a rados bench to measure the
>             throughput of the cluster,
>             and found that the performance is really poor:
> 
>             The setup is the following:
> 
>             OS: Ubuntu 12.10 Server 64 bits
> 
> 
>             ceph-node01(mon) 10.77.0.101 ProLiant BL460c G7 32GB
>             8 x 2 Ghz
>                                               10.1.1.151 D2200sb
>             Storage Blade
>             (Firmware: 2.30)
>             ceph-node02(mon) 10.77.0.102 ProLiant BL460c G7 64GB
>             8 x 2 Ghz
>                                               10.1.1.152 D2200sb
>             Storage Blade
>             (Firmware: 2.30)
>             ceph-node03(mon) 10.77.0.103 ProLiant BL460c G6 32GB
>             8 x 2 Ghz
>                                               10.1.1.153 D2200sb
>             Storage Blade
>             (Firmware: 2.30)
>             ceph-node04 10.77.0.104 ProLiant BL460c G7 32GB 8 x
>             2 Ghz
>                                              10.1.1.154 D2200sb
>             Storage Blade
>             (Firmware: 2.30)
>             ceph-node05(deploy) 10.77.0.105 ProLiant BL460c G6
>             32GB 8 x
>             2 Ghz
>                                                  10.1.1.155
>             D2200sb Storage
>             Blade (Firmware: 2.30)
> 
> 
>       If your servers have controllers with writeback cache, please
>       make sure
>       it is enabled as that will likely help.
> 
> 
>             ceph-node01:
> 
>                    /dev/sda 73G (OSD)
>                    /dev/sdb 73G (OSD)
>                    /dev/sdc 73G (OSD)
>                    /dev/sdd 73G (OSD)
>                    /dev/sde 73G (OSD)
>                    /dev/sdf 73G (OSD)
>                    /dev/sdg 73G (OSD)
>                    /dev/sdh 73G (OSD)
>                    /dev/sdi 73G (OSD)
>                    /dev/sdj 73G (Journal)
>                    /dev/sdk 500G (OSD)
>                    /dev/sdl 500G (OSD)
>                    /dev/sdn 146G (Journal)
> 
>             ceph-node02:
> 
>                    /dev/sda 73G (OSD)
>                    /dev/sdb 73G (OSD)
>                    /dev/sdc 73G (OSD)
>                    /dev/sdd 73G (OSD)
>                    /dev/sde 73G (OSD)
>                    /dev/sdf 73G (OSD)
>                    /dev/sdg 73G (OSD)
>                    /dev/sdh 73G (OSD)
>                    /dev/sdi 73G (OSD)
>                    /dev/sdj 73G (Journal)
>                    /dev/sdk 500G (OSD)
>                    /dev/sdl 500G (OSD)
>                    /dev/sdn 146G (Journal)
> 
>             ceph-node03:
> 
>                    /dev/sda 73G (OSD)
>                    /dev/sdb 73G (OSD)
>                    /dev/sdc 73G (OSD)
>                    /dev/sdd 73G (OSD)
>                    /dev/sde 73G (OSD)
>                    /dev/sdf 73G (OSD)
>                    /dev/sdg 73G (OSD)
>                    /dev/sdh 73G (OSD)
>                    /dev/sdi 73G (OSD)
>                    /dev/sdj 73G (Journal)
>                    /dev/sdk 500G (OSD)
>                    /dev/sdl 500G (OSD)
>                    /dev/sdn 73G (Journal)
> 
>             ceph-node04:
> 
>                    /dev/sda 73G (OSD)
>                    /dev/sdb 73G (OSD)
>                    /dev/sdc 73G (OSD)
>                    /dev/sdd 73G (OSD)
>                    /dev/sde 73G (OSD)
>                    /dev/sdf 73G (OSD)
>                    /dev/sdg 73G (OSD)
>                    /dev/sdh 73G (OSD)
>                    /dev/sdi 73G (OSD)
>                    /dev/sdj 73G (Journal)
>                    /dev/sdk 500G (OSD)
>                    /dev/sdl 500G (OSD)
>                    /dev/sdn 146G (Journal)
> 
>             ceph-node05:
> 
>                    /dev/sda 73G (OSD)
>                    /dev/sdb 73G (OSD)
>                    /dev/sdc 73G (OSD)
>                    /dev/sdd 73G (OSD)
>                    /dev/sde 73G (OSD)
>                    /dev/sdf 73G (OSD)
>                    /dev/sdg 73G (OSD)
>                    /dev/sdh 73G (OSD)
>                    /dev/sdi 73G (OSD)
>                    /dev/sdj 73G (Journal)
>                    /dev/sdk 500G (OSD)
>                    /dev/sdl 500G (OSD)
>                    /dev/sdn 73G (Journal)
> 
> 
>       Am I correct in assuming that you've put all of your journals
>       for every
>       disk in each node on two spinning disks? This is going to be
>       quite
>       slow, because Ceph does a full write of the data the journal for
>       every
>       real write. The general solution is to either use SSDs for
>       journals
>       (preferably multiple fast SSDs with high write endurance and
>       only 3-6
>       OSD journals each), or put the journals on a partition on the
>       data disk.
> 
> 
>             And the OSD tree is:
> 
>             root@ceph-node03:/home/ceph# ceph osd tree
>             # id weight type name up/down reweight
>             -1 7.27 root default
>             -2 1.15 host ceph-node01
>             12 0.06999 osd.12 up 1
>             13 0.06999 osd.13 up 1
>             14 0.06999 osd.14 up 1
>             15 0.06999 osd.15 up 1
>             16 0.06999 osd.16 up 1
>             17 0.06999 osd.17 up 1
>             18 0.06999 osd.18 up 1
>             19 0.06999 osd.19 up 1
>             20 0.06999 osd.20 up 1
>             21 0.45 osd.21 up 1
>             22 0.06999 osd.22 up 1
>             -3 1.53 host ceph-node02
>             23 0.06999 osd.23 up 1
>             24 0.06999 osd.24 up 1
>             25 0.06999 osd.25 up 1
>             26 0.06999 osd.26 up 1
>             27 0.06999 osd.27 up 1
>             28 0.06999 osd.28 up 1
>             29 0.06999 osd.29 up 1
>             30 0.06999 osd.30 up 1
>             31 0.06999 osd.31 up 1
>             32 0.45 osd.32 up 1
>             33 0.45 osd.33 up 1
>             -4 1.53 host ceph-node03
>             34 0.06999 osd.34 up 1
>             35 0.06999 osd.35 up 1
>             36 0.06999 osd.36 up 1
>             37 0.06999 osd.37 up 1
>             38 0.06999 osd.38 up 1
>             39 0.06999 osd.39 up 1
>             40 0.06999 osd.40 up 1
>             41 0.06999 osd.41 up 1
>             42 0.06999 osd.42 up 1
>             43 0.45 osd.43 up 1
>             44 0.45 osd.44 up 1
>             -5 1.53 host ceph-node04
>             0 0.06999 osd.0 up 1
>             1 0.06999 osd.1 up 1
>             2 0.06999 osd.2 up 1
>             3 0.06999 osd.3 up 1
>             4 0.06999 osd.4 up 1
>             5 0.06999 osd.5 up 1
>             6 0.06999 osd.6 up 1
>             7 0.06999 osd.7 up 1
>             8 0.06999 osd.8 up 1
>             9 0.45 osd.9 up 1
>             10 0.45 osd.10 up 1
>             -6 1.53 host ceph-node05
>             11 0.06999 osd.11 up 1
>             45 0.06999 osd.45 up 1
>             46 0.06999 osd.46 up 1
>             47 0.06999 osd.47 up 1
>             48 0.06999 osd.48 up 1
>             49 0.06999 osd.49 up 1
>             50 0.06999 osd.50 up 1
>             51 0.06999 osd.51 up 1
>             52 0.06999 osd.52 up 1
>             53 0.45 osd.53 up 1
>             54 0.45 osd.54 up 1
> 
> 
>       Based on this, it appears your 500GB drives are weighted much
>       higher
>       than the 73GB drives. This will help even data distribution out,
>       but
>       unfortunately will cause the system to be slower if all of the
>       OSDs are
>       in the same pool. What this does is cause the 500GB drives to
>       get a
>       higher proportion of the writes than the other drives, but those
>       drives
>       are almost certainly no faster than the other ones. Because
>       there is a
>       limited number of outstanding IOs you can have (due to memory
>       constraints), eventually all outstanding IOs will be waiting on
>       the
>       500GB disks while the 73GB disks mostly sit around waiting for
>       work.
> 
>       What I'd suggest doing is putting all of your 73 disks in the
>       same pool
>       and your 500GB disks in another pool. I suspect that if you do
>       that and
>       put your journals on the first partition of each disk, you'll
>       see some
>       improvement in your benchmark results.
> 
> 
> 
>             And the result:
> 
>             root@ceph-node03:/home/ceph# rados bench -p
>             ceph-cloud 20 write -t 10
>                 Maintaining 10 concurrent writes of 4194304
>             bytes for up to 20 seconds
>             or 0 objects
>                 Object prefix: benchmark_data_ceph-node03_29727
>                   sec Cur ops started finished avg MB/s cur MB/s
>             last lat avg lat
>                     0 0 0 0 0 0 - 0
>                     1 10 30 20 79.9465 80 0.159295 0.378849
>                     2 10 52 42 83.9604 88 0.719616 0.430293
>                     3 10 74 64 85.2991 88 0.487685 0.412956
>                     4 10 97 87 86.9676 92 0.351122 0.418814
>                     5 10 123 113 90.3679 104 0.317011 0.418876
>                     6 10 147 137 91.3012 96 0.562112 0.418178
>                     7 10 172 162 92.5398 100 0.691045 0.413416
>                     8 10 197 187 93.469 100 0.459424 0.415459
>                     9 10 222 212 94.1915 100 0.798889 0.416093
>                    10 10 248 238 95.1697 104 0.440002 0.415609
>                    11 10 267 257 93.4252 76 0.48959 0.41531
>                    12 10 289 279 92.9707 88 0.524622 0.420145
>                    13 10 313 303 93.2016 96 1.02104 0.423955
>                    14 10 336 326 93.1136 92 0.477328 0.420684
>                    15 10 359 349 93.037 92 0.591118 0.418589
>                    16 10 383 373 93.2204 96 0.600392 0.421916
>                    17 10 407 397 93.3812 96 0.240166 0.419829
>                    18 10 431 421 93.526 96 0.746706 0.420971
>                    19 10 457 447 94.0757 104 0.237565 0.419025
>             2013-12-27 13:13:21.817874min lat: 0.101352 max lat:
>             1.81426 avg lat:
>             0.418242
>                   sec Cur ops started finished avg MB/s cur MB/s
>             last lat avg lat
>                    20 10 480 470 93.9709 92 0.489254 0.418242
>                 Total time run: 20.258064
>             Total writes made: 481
>             Write size: 4194304
>             Bandwidth (MB/sec): 94.975
> 
>             Stddev Bandwidth: 21.7799
>             Max bandwidth (MB/sec): 104
>             Min bandwidth (MB/sec): 0
>             Average Latency: 0.420573
>             Stddev Latency: 0.226378
>             Max latency: 1.81426
>             Min latency: 0.101352
>             root@ceph-node03:/home/ceph#
> 
>             Thanks in advance,
> 
>             Best regards,
> 
>             *German Anders*
> 
> 
> 
> 
> 
> 
> 
> 
> 
>             _______________________________________________
>             ceph-users mailing list
>             ceph-users@xxxxxxxxxxxxxx
>             http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
>       _______________________________________________
>       ceph-users mailing list
>       ceph-users@xxxxxxxxxxxxxx
>       http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com