Re: Low speed of write to cephfs

Max Yehorov <myehorov@xxxxxxxxxx> · Thu, 15 Oct 2015 13:56:02 -0700

Did you try to fio test the journal device (direct=1), so you would
have an idea where the system is bounded. For a journal to be
effective it must operate faster than backstore sata. In your case the
journal device must support the needs of 12 SATA drives.

Approximate theoretical limits for 12 SATA drives are about:
12 * 100 IOPS/disk = 1000 IOPS
12 * 100 Mb/s = 1.2Gb/s (that's way too generous for random SATA access)

After all, you can run atop on an OSD machine and see where the
bottleneck is, that will also give real data on disks IOPS.

On Thu, Oct 15, 2015 at 1:35 PM, Butkeev Stas <staerist@xxxxx> wrote:
> Hello Max,
>
> It is 15G scsi disk which was exported from Flash array to server.
> # multipath -ll
> XXXXXXXXXXXXXXXXXXXXXXXXX dm-3 XXXXXXXXXX
> size=15G features='0' hwhandler='0' wp=rw
> `-+- policy='round-robin 0' prio=1 status=active
>   |- 5:0:0:2 sdp 8:240 active ready running
>   |- 4:0:0:2 sdq 65:0  active ready running
>   |- 6:0:0:2 sds 65:32 active ready running
>   `- 7:0:0:2 sdu 65:64 active ready running
>
> In config you can see option "osd journal size = 1000". I use 12G on each node for ceph journal
>
> For example
>
> # ls -l /CEPH_JOURNAL/*/*
> /CEPH_JOURNAL/osd/ceph-0:
> total 1024000
> -rw-r--r-- 1 root root 1048576000 Oct 15 19:03 journal
>
> /CEPH_JOURNAL/osd/ceph-1:
> total 1024000
> -rw-r--r-- 1 root root 1048576000 Oct 15 19:03 journal
>
> /CEPH_JOURNAL/osd/ceph-10:
> total 1024000
> -rw-r--r-- 1 root root 1048576000 Oct 15 19:04 journal
>
> /CEPH_JOURNAL/osd/ceph-11:
> total 1024000
> -rw-r--r-- 1 root root 1048576000 Oct 15 19:03 journal
>
> /CEPH_JOURNAL/osd/ceph-2:
> total 1024000
> -rw-r--r-- 1 root root 1048576000 Oct 15 19:03 journal
>
> /CEPH_JOURNAL/osd/ceph-3:
> total 1024000
> -rw-r--r-- 1 root root 1048576000 Oct 15 19:03 journal
> .......
> --
> Best Regards,
> Stanislav Butkeev
>
>
> 15.10.2015, 23:26, "Max Yehorov" <myehorov@xxxxxxxxxx>:
>> Stas,
>>
>> as you said: "Each server has 15G flash for ceph journal and 12*2Tb
>> SATA disk for"
>>
>> What is this 15G flash and is it used for all 12 SATA drives?
>>
>> On Thu, Oct 15, 2015 at 1:05 PM, John Spray <jspray@xxxxxxxxxx> wrote:
>>>  On Thu, Oct 15, 2015 at 8:46 PM, Butkeev Stas <staerist@xxxxx> wrote:
>>>>  Thank you for your comment. I know what does mean option oflag=direct and other things about stress testing.
>>>>  Unfortunately speed is very slow for this cluster FS.
>>>>
>>>>  The same test on another cluster FS(GPFS) which consist of 4 disks
>>>>
>>>>  # dd if=/dev/zero|pv|dd oflag=direct of=99999 bs=4k count=10k
>>>>  40.1MB 0:00:05 [7.57MB/s] [ <=> ]
>>>>  10240+0 records in
>>>>  10240+0 records out
>>>>  41943040 bytes (42 MB) copied, 5.27816 s, 7.9 MB/s
>>>>
>>>>  I hope that I miss some options during configuration or something else.
>>>
>>>  I don't know much about GPFS internals, since it's proprietary, but a
>>>  quick google brings us here:
>>>  http://www-01.ibm.com/support/knowledgecenter/SSFKCN_4.1.0.4/com.ibm.cluster.gpfs.v4r104.gpfs100.doc/bl1adm_considerations_direct_io.htm
>>>
>>>  It appears that GPFS only respects O_DIRECT in certain circumstances,
>>>  and in some circumstances will use their "pagepool" cache even when
>>>  direct IO is requested. You would probably need to check with IBM to
>>>  work out exactly whether true direct IO is happening when you run on
>>>  GPFS.
>>>
>>>  John
>>>
>>>>  --
>>>>  Best Regards,
>>>>  Stanislav Butkeev
>>>>
>>>>  15.10.2015, 22:36, "John Spray" <jspray@xxxxxxxxxx>:
>>>>>  On Thu, Oct 15, 2015 at 8:17 PM, Butkeev Stas <staerist@xxxxx> wrote:
>>>>>>   Hello John
>>>>>>
>>>>>>   Yes, of course, write speed is rising, because we are increasing amount of data per one operation by disk.
>>>>>>   But, do you know at least one software which write data by 1Mb blocks? I don't know, you too.
>>>>>
>>>>>  Plenty of applications do large writes, especially if they're intended
>>>>>  for use on network filesystems.
>>>>>
>>>>>  When you pass oflag=direct, you are asking the kernel to send these
>>>>>  writes individually instead of aggregating them in the page cache.
>>>>>  What you're measuring here is effectively the issue rate of small
>>>>>  messages, rather than the speed at which data can be written to ceph.
>>>>>
>>>>>  Try the same benchmark with NFS, you'll get a similar scaling with block size.
>>>>>
>>>>>  Cheers,
>>>>>  John
>>>>>
>>>>>  If you want to aggregate these writes in the page cache before sending
>>>>>  them over the network, I imagine you probably need to disable direct
>>>>>  IO.
>>>>>
>>>>>>   Simple test: dd to common 2Tb SATA disk
>>>>>>
>>>>>>   # dd if=/dev/zero|pv|dd oflag=direct of=/dev/sdi bs=4k count=1M
>>>>>>      4GiB 0:00:46 [87.2MiB/s] [ <=> ]
>>>>>>   1048576+0 records in
>>>>>>   1048576+0 records out
>>>>>>   4294967296 bytes (4.3 GB) copied, 46.9688 s, 91.4 MB/s
>>>>>>
>>>>>>   # dd if=/dev/zero|pv|dd oflag=direct of=/dev/sdi bs=32k count=10k
>>>>>>   dd: warning: partial read (24576 bytes); suggest iflag=fullblock
>>>>>>    319MiB 0:00:03 [ 103MiB/s] [ <=> ]
>>>>>>   10219+21 records in
>>>>>>   10219+21 records out
>>>>>>   335262720 bytes (335 MB) copied, 3.15001 s, 106 MB/s
>>>>>>
>>>>>>   One SATA disk has better rate than cephfs which consist of 24 the same disks.
>>>>>>
>>>>>>   --
>>>>>>   Best Regards,
>>>>>>   Stanislav Butkeev
>>>>>>
>>>>>>   15.10.2015, 21:49, "John Spray" <jspray@xxxxxxxxxx>:
>>>>>>>   On Thu, Oct 15, 2015 at 5:11 PM, Butkeev Stas <staerist@xxxxx> wrote:
>>>>>>>>    Hello all,
>>>>>>>>    Does anybody try to use cephfs?
>>>>>>>>
>>>>>>>>    I have two servers with RHEL7.1(latest kernel 3.10.0-229.14.1.el7.x86_64). Each server has 15G flash for ceph journal and 12*2Tb SATA disk for data.
>>>>>>>>    I have Infiniband(ipoib) 56Gb/s interconnect between nodes.
>>>>>>>>
>>>>>>>>    Cluster version
>>>>>>>>    # ceph -v
>>>>>>>>    ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
>>>>>>>>
>>>>>>>>    Cluster config
>>>>>>>>    # cat /etc/ceph/ceph.conf
>>>>>>>>    [global]
>>>>>>>>            auth service required = cephx
>>>>>>>>            auth client required = cephx
>>>>>>>>            auth cluster required = cephx
>>>>>>>>            fsid = 0f05deaf-ee6f-4342-b589-5ecf5527aa6f
>>>>>>>>            mon osd full ratio = .95
>>>>>>>>            mon osd nearfull ratio = .90
>>>>>>>>            osd pool default size = 2
>>>>>>>>            osd pool default min size = 1
>>>>>>>>            osd pool default pg num = 32
>>>>>>>>            osd pool default pgp num = 32
>>>>>>>>            max open files = 131072
>>>>>>>>            osd crush chooseleaf type = 1
>>>>>>>>    [mds]
>>>>>>>>
>>>>>>>>    [mds.a]
>>>>>>>>            host = ak34
>>>>>>>>
>>>>>>>>    [mon]
>>>>>>>>            mon_initial_members = a,b
>>>>>>>>
>>>>>>>>    [mon.a]
>>>>>>>>            host = ak34
>>>>>>>>            mon addr = 172.24.32.134:6789
>>>>>>>>
>>>>>>>>    [mon.b]
>>>>>>>>            host = ak35
>>>>>>>>            mon addr = 172.24.32.135:6789
>>>>>>>>
>>>>>>>>    [osd]
>>>>>>>>            osd journal size = 1000
>>>>>>>>
>>>>>>>>    [osd.0]
>>>>>>>>            osd uuid = b3b3cd37-8df5-4455-8104-006ddba2c443
>>>>>>>>            host = ak34
>>>>>>>>            public addr = 172.24.32.134
>>>>>>>>            osd journal = /CEPH_JOURNAL/osd/ceph-0/journal
>>>>>>>>    .....
>>>>>>>>
>>>>>>>>    Below tree of cluster
>>>>>>>>    # ceph osd tree
>>>>>>>>    ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
>>>>>>>>    -1 45.75037 root default
>>>>>>>>    -2 45.75037 region RU
>>>>>>>>    -3 45.75037 datacenter ru-msk-ak48t
>>>>>>>>    -4 22.87518 host ak34
>>>>>>>>     0 1.90627 osd.0 up 1.00000 1.00000
>>>>>>>>     1 1.90627 osd.1 up 1.00000 1.00000
>>>>>>>>     2 1.90627 osd.2 up 1.00000 1.00000
>>>>>>>>     3 1.90627 osd.3 up 1.00000 1.00000
>>>>>>>>     4 1.90627 osd.4 up 1.00000 1.00000
>>>>>>>>     5 1.90627 osd.5 up 1.00000 1.00000
>>>>>>>>     6 1.90627 osd.6 up 1.00000 1.00000
>>>>>>>>     7 1.90627 osd.7 up 1.00000 1.00000
>>>>>>>>     8 1.90627 osd.8 up 1.00000 1.00000
>>>>>>>>     9 1.90627 osd.9 up 1.00000 1.00000
>>>>>>>>    10 1.90627 osd.10 up 1.00000 1.00000
>>>>>>>>    11 1.90627 osd.11 up 1.00000 1.00000
>>>>>>>>    -5 22.87518 host ak35
>>>>>>>>    12 1.90627 osd.12 up 1.00000 1.00000
>>>>>>>>    13 1.90627 osd.13 up 1.00000 1.00000
>>>>>>>>    14 1.90627 osd.14 up 1.00000 1.00000
>>>>>>>>    15 1.90627 osd.15 up 1.00000 1.00000
>>>>>>>>    16 1.90627 osd.16 up 1.00000 1.00000
>>>>>>>>    17 1.90627 osd.17 up 1.00000 1.00000
>>>>>>>>    18 1.90627 osd.18 up 1.00000 1.00000
>>>>>>>>    19 1.90627 osd.19 up 1.00000 1.00000
>>>>>>>>    20 1.90627 osd.20 up 1.00000 1.00000
>>>>>>>>    21 1.90627 osd.21 up 1.00000 1.00000
>>>>>>>>    22 1.90627 osd.22 up 1.00000 1.00000
>>>>>>>>    23 1.90627 osd.23 up 1.00000 1.00000
>>>>>>>>
>>>>>>>>    Status of cluster
>>>>>>>>    # ceph -s
>>>>>>>>        cluster 0f05deaf-ee6f-4342-b589-5ecf5527aa6f
>>>>>>>>         health HEALTH_OK
>>>>>>>>         monmap e1: 2 mons at {a=172.24.32.134:6789/0,b=172.24.32.135:6789/0}
>>>>>>>>                election epoch 10, quorum 0,1 a,b
>>>>>>>>         mdsmap e14: 1/1/1 up {0=a=up:active}
>>>>>>>>         osdmap e194: 24 osds: 24 up, 24 in
>>>>>>>>          pgmap v2305: 384 pgs, 3 pools, 271 GB data, 72288 objects
>>>>>>>>                545 GB used, 44132 GB / 44678 GB avail
>>>>>>>>                     384 active+clean
>>>>>>>>
>>>>>>>>    Pools for cephfs
>>>>>>>>    ]# ceph osd dump|grep pg
>>>>>>>>    pool 1 'cephfs_data' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 154 flags hashpspool crash_replay_interval 45 stripe_width 0
>>>>>>>>    pool 2 'cephfs_metadata' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 144 flags hashpspool stripe_width 0
>>>>>>>>
>>>>>>>>    Rados bench
>>>>>>>>    # rados bench -p cephfs_data 300 write --no-cleanup && rados bench -p cephfs_data 300 seq
>>>>>>>>     Maintaining 16 concurrent writes of 4194304 bytes for up to 300 seconds or 0 objects
>>>>>>>>     Object prefix: benchmark_data_XXXXXXXXXXXXXXXXXXXX_8108
>>>>>>>>       sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
>>>>>>>>         0 0 0 0 0 0 - 0
>>>>>>>>         1 16 170 154 615.74 616 0.109984 0.0978277
>>>>>>>>         2 16 335 319 637.817 660 0.0623079 0.0985001
>>>>>>>>         3 16 496 480 639.852 644 0.0992808 0.0982317
>>>>>>>>         4 16 662 646 645.862 664 0.0683485 0.0980203
>>>>>>>>         5 16 831 815 651.796 676 0.0773545 0.0973635
>>>>>>>>         6 15 994 979 652.479 656 0.112323 0.096901
>>>>>>>>         7 16 1164 1148 655.826 676 0.107592 0.0969845
>>>>>>>>         8 16 1327 1311 655.335 652 0.0960067 0.0968445
>>>>>>>>         9 16 1488 1472 654.066 644 0.0780589 0.0970879
>>>>>>>>
>>>>>>>>    .....
>>>>>>>>       297 16 43445 43429 584.811 596 0.0569516 0.109399
>>>>>>>>       298 16 43601 43585 584.942 624 0.0707439 0.109388
>>>>>>>>       299 16 43756 43740 585.059 620 0.20408 0.109363
>>>>>>>>    2015-10-15 14:16:59.622610min lat: 0.0109677 max lat: 0.951389 avg lat: 0.109344
>>>>>>>>       sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
>>>>>>>>       300 13 43901 43888 585.082 592 0.0768806 0.109344
>>>>>>>>     Total time run: 300.329089
>>>>>>>>    Total reads made: 43901
>>>>>>>>    Read size: 4194304
>>>>>>>>    Bandwidth (MB/sec): 584.705
>>>>>>>>
>>>>>>>>    Average Latency: 0.109407
>>>>>>>>    Max latency: 0.951389
>>>>>>>>    Min latency: 0.0109677
>>>>>>>>
>>>>>>>>    But real write speed is very low
>>>>>>>>
>>>>>>>>    # dd if=/dev/zero|pv|dd oflag=direct of=44444 bs=4k count=10k
>>>>>>>>    10240+0 records in1.5MiB/s] [ <=> ]
>>>>>>>>    10240+0 records out
>>>>>>>>    41943040 bytes (42 MB) copied, 25.9155 s, 1.6 MB/s
>>>>>>>>    40.1MiB 0:00:25 [1.55MiB/s] [ <=> ]
>>>>>>>>
>>>>>>>>    # dd if=/dev/zero|pv|dd oflag=direct of=44444 bs=32k count=10k
>>>>>>>>    10240+0 records in0.5MiB/s] [ <=> ]
>>>>>>>>    10240+0 records out
>>>>>>>>    335544320 bytes (336 MB) copied, 28.2998 s, 11.9 MB/s
>>>>>>>>     320MiB 0:00:28 [11.3MiB/s] [ <=> ]
>>>>>>>
>>>>>>>   So what happens if you continue increasing the 'bs' parameter? Is
>>>>>>>   bs=1M nice and fast?
>>>>>>>
>>>>>>>   John
>>>>>>>
>>>>>>>>    Do you know of root cause of low speed of write to FS?
>>>>>>>>
>>>>>>>>    Thank you for help in advance!!
>>>>>>>>
>>>>>>>>    --
>>>>>>>>    Best Regards,
>>>>>>>>    Stanislav Butkeev
>>>>>>>>    _______________________________________________
>>>>>>>>    ceph-users mailing list
>>>>>>>>    ceph-users@xxxxxxxxxxxxxx
>>>>>>>>    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>  _______________________________________________
>>>  ceph-users mailing list
>>>  ceph-users@xxxxxxxxxxxxxx
>>>  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com