Re: Low speed of write to cephfs

John Spray <jspray@xxxxxxxxxx> · Thu, 15 Oct 2015 19:49:01 +0100

On Thu, Oct 15, 2015 at 5:11 PM, Butkeev Stas <staerist@xxxxx> wrote:
> Hello all,
> Does anybody try to use cephfs?
>
> I have two servers with RHEL7.1(latest kernel 3.10.0-229.14.1.el7.x86_64). Each server has 15G flash for ceph journal and 12*2Tb SATA disk for data.
> I have Infiniband(ipoib) 56Gb/s interconnect between nodes.
>
>
> Cluster version
> # ceph -v
> ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
>
> Cluster config
> # cat /etc/ceph/ceph.conf
> [global]
>         auth service required = cephx
>         auth client required = cephx
>         auth cluster required = cephx
>         fsid = 0f05deaf-ee6f-4342-b589-5ecf5527aa6f
>         mon osd full ratio = .95
>         mon osd nearfull ratio = .90
>         osd pool default size = 2
>         osd pool default min size = 1
>         osd pool default pg num = 32
>         osd pool default pgp num = 32
>         max open files = 131072
>         osd crush chooseleaf type = 1
> [mds]
>
> [mds.a]
>         host = ak34
>
> [mon]
>         mon_initial_members = a,b
>
> [mon.a]
>         host = ak34
>         mon addr  = 172.24.32.134:6789
>
> [mon.b]
>         host = ak35
>         mon addr  = 172.24.32.135:6789
>
> [osd]
>         osd journal size = 1000
>
> [osd.0]
>         osd uuid = b3b3cd37-8df5-4455-8104-006ddba2c443
>         host = ak34
>         public addr  = 172.24.32.134
>         osd journal = /CEPH_JOURNAL/osd/ceph-0/journal
> .....
>
>
> Below tree of cluster
> # ceph osd tree
> ID WEIGHT   TYPE NAME                       UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 45.75037 root default
> -2 45.75037     region RU
> -3 45.75037         datacenter ru-msk-ak48t
> -4 22.87518             host ak34
>  0  1.90627                 osd.0                up  1.00000          1.00000
>  1  1.90627                 osd.1                up  1.00000          1.00000
>  2  1.90627                 osd.2                up  1.00000          1.00000
>  3  1.90627                 osd.3                up  1.00000          1.00000
>  4  1.90627                 osd.4                up  1.00000          1.00000
>  5  1.90627                 osd.5                up  1.00000          1.00000
>  6  1.90627                 osd.6                up  1.00000          1.00000
>  7  1.90627                 osd.7                up  1.00000          1.00000
>  8  1.90627                 osd.8                up  1.00000          1.00000
>  9  1.90627                 osd.9                up  1.00000          1.00000
> 10  1.90627                 osd.10               up  1.00000          1.00000
> 11  1.90627                 osd.11               up  1.00000          1.00000
> -5 22.87518             host ak35
> 12  1.90627                 osd.12               up  1.00000          1.00000
> 13  1.90627                 osd.13               up  1.00000          1.00000
> 14  1.90627                 osd.14               up  1.00000          1.00000
> 15  1.90627                 osd.15               up  1.00000          1.00000
> 16  1.90627                 osd.16               up  1.00000          1.00000
> 17  1.90627                 osd.17               up  1.00000          1.00000
> 18  1.90627                 osd.18               up  1.00000          1.00000
> 19  1.90627                 osd.19               up  1.00000          1.00000
> 20  1.90627                 osd.20               up  1.00000          1.00000
> 21  1.90627                 osd.21               up  1.00000          1.00000
> 22  1.90627                 osd.22               up  1.00000          1.00000
> 23  1.90627                 osd.23               up  1.00000          1.00000
>
> Status of cluster
> # ceph -s
>     cluster 0f05deaf-ee6f-4342-b589-5ecf5527aa6f
>      health HEALTH_OK
>      monmap e1: 2 mons at {a=172.24.32.134:6789/0,b=172.24.32.135:6789/0}
>             election epoch 10, quorum 0,1 a,b
>      mdsmap e14: 1/1/1 up {0=a=up:active}
>      osdmap e194: 24 osds: 24 up, 24 in
>       pgmap v2305: 384 pgs, 3 pools, 271 GB data, 72288 objects
>             545 GB used, 44132 GB / 44678 GB avail
>                  384 active+clean
>
>
> Pools for cephfs
> ]# ceph osd dump|grep pg
> pool 1 'cephfs_data' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 154 flags hashpspool crash_replay_interval 45 stripe_width 0
> pool 2 'cephfs_metadata' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 144 flags hashpspool stripe_width 0
>
> Rados bench
> # rados bench -p cephfs_data 300 write --no-cleanup && rados bench -p cephfs_data 300 seq
>  Maintaining 16 concurrent writes of 4194304 bytes for up to 300 seconds or 0 objects
>  Object prefix: benchmark_data_XXXXXXXXXXXXXXXXXXXX_8108
>    sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>      0       0         0         0         0         0         -         0
>      1      16       170       154    615.74       616  0.109984 0.0978277
>      2      16       335       319   637.817       660 0.0623079 0.0985001
>      3      16       496       480   639.852       644 0.0992808 0.0982317
>      4      16       662       646   645.862       664 0.0683485 0.0980203
>      5      16       831       815   651.796       676 0.0773545 0.0973635
>      6      15       994       979   652.479       656  0.112323  0.096901
>      7      16      1164      1148   655.826       676  0.107592 0.0969845
>      8      16      1327      1311   655.335       652 0.0960067 0.0968445
>      9      16      1488      1472   654.066       644 0.0780589 0.0970879
>
> .....
>    297      16     43445     43429   584.811       596 0.0569516  0.109399
>    298      16     43601     43585   584.942       624 0.0707439  0.109388
>    299      16     43756     43740   585.059       620   0.20408  0.109363
> 2015-10-15 14:16:59.622610min lat: 0.0109677 max lat: 0.951389 avg lat: 0.109344
>    sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>    300      13     43901     43888   585.082       592 0.0768806  0.109344
>  Total time run:        300.329089
> Total reads made:     43901
> Read size:            4194304
> Bandwidth (MB/sec):    584.705
>
> Average Latency:       0.109407
> Max latency:           0.951389
> Min latency:           0.0109677
>
> But real write speed is very low
>
> # dd if=/dev/zero|pv|dd oflag=direct of=44444 bs=4k count=10k
> 10240+0 records in1.5MiB/s] [                                                                     <=>                                                                     ]
> 10240+0 records out
> 41943040 bytes (42 MB) copied, 25.9155 s, 1.6 MB/s
> 40.1MiB 0:00:25 [1.55MiB/s] [                                                                       <=>                                                                   ]
>
> # dd if=/dev/zero|pv|dd oflag=direct of=44444 bs=32k count=10k
> 10240+0 records in0.5MiB/s] [                                                                             <=>                                                             ]
> 10240+0 records out
> 335544320 bytes (336 MB) copied, 28.2998 s, 11.9 MB/s
>  320MiB 0:00:28 [11.3MiB/s] [                                                                                <=>                                                          ]

So what happens if you continue increasing the 'bs' parameter?  Is
bs=1M nice and fast?

John

>
> Do you know of root cause of low speed of write to FS?
>
> Thank you for help in advance!!
>
> --
> Best Regards,
> Stanislav Butkeev
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com