Re: Low speed of write to cephfs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Yes,The GPFS use option "pagepool" for caching IO. 
But my cluster now use tiny piece of the memory for caching. And we can consider that this cluster doesn't use cache.
# mmlsconfig
Configuration data for cluster XXXXXX:
---------------------------------------------
myNodeConfigNumber 3
clusterName ebs.ak315t.c2
clusterId 17642399629499555593
autoload no
pagepool 1k
dmapiFileHandleSize 32
minReleaseLevel 3.5.0.11
verbsPorts mlx4_0/1 mlx4_0/2
verbsRdma enable
adminMode central

File systems in cluster XXXXXX:
--------------------------------------
/dev/XXXXX

-- 
Best Regards,
Stanislav Butkeev


15.10.2015, 23:05, "John Spray" <jspray@xxxxxxxxxx>:
> On Thu, Oct 15, 2015 at 8:46 PM, Butkeev Stas <staerist@xxxxx> wrote:
>>  Thank you for your comment. I know what does mean option oflag=direct and other things about stress testing.
>>  Unfortunately speed is very slow for this cluster FS.
>>
>>  The same test on another cluster FS(GPFS) which consist of 4 disks
>>
>>  # dd if=/dev/zero|pv|dd oflag=direct of=99999 bs=4k count=10k
>>  40.1MB 0:00:05 [7.57MB/s] [ <=> ]
>>  10240+0 records in
>>  10240+0 records out
>>  41943040 bytes (42 MB) copied, 5.27816 s, 7.9 MB/s
>>
>>  I hope that I miss some options during configuration or something else.
>
> I don't know much about GPFS internals, since it's proprietary, but a
> quick google brings us here:
> http://www-01.ibm.com/support/knowledgecenter/SSFKCN_4.1.0.4/com.ibm.cluster.gpfs.v4r104.gpfs100.doc/bl1adm_considerations_direct_io.htm
>
> It appears that GPFS only respects O_DIRECT in certain circumstances,
> and in some circumstances will use their "pagepool" cache even when
> direct IO is requested. You would probably need to check with IBM to
> work out exactly whether true direct IO is happening when you run on
> GPFS.
>
> John
>
>>  --
>>  Best Regards,
>>  Stanislav Butkeev
>>
>>  15.10.2015, 22:36, "John Spray" <jspray@xxxxxxxxxx>:
>>>  On Thu, Oct 15, 2015 at 8:17 PM, Butkeev Stas <staerist@xxxxx> wrote:
>>>>   Hello John
>>>>
>>>>   Yes, of course, write speed is rising, because we are increasing amount of data per one operation by disk.
>>>>   But, do you know at least one software which write data by 1Mb blocks? I don't know, you too.
>>>
>>>  Plenty of applications do large writes, especially if they're intended
>>>  for use on network filesystems.
>>>
>>>  When you pass oflag=direct, you are asking the kernel to send these
>>>  writes individually instead of aggregating them in the page cache.
>>>  What you're measuring here is effectively the issue rate of small
>>>  messages, rather than the speed at which data can be written to ceph.
>>>
>>>  Try the same benchmark with NFS, you'll get a similar scaling with block size.
>>>
>>>  Cheers,
>>>  John
>>>
>>>  If you want to aggregate these writes in the page cache before sending
>>>  them over the network, I imagine you probably need to disable direct
>>>  IO.
>>>
>>>>   Simple test: dd to common 2Tb SATA disk
>>>>
>>>>   # dd if=/dev/zero|pv|dd oflag=direct of=/dev/sdi bs=4k count=1M
>>>>      4GiB 0:00:46 [87.2MiB/s] [ <=> ]
>>>>   1048576+0 records in
>>>>   1048576+0 records out
>>>>   4294967296 bytes (4.3 GB) copied, 46.9688 s, 91.4 MB/s
>>>>
>>>>   # dd if=/dev/zero|pv|dd oflag=direct of=/dev/sdi bs=32k count=10k
>>>>   dd: warning: partial read (24576 bytes); suggest iflag=fullblock
>>>>    319MiB 0:00:03 [ 103MiB/s] [ <=> ]
>>>>   10219+21 records in
>>>>   10219+21 records out
>>>>   335262720 bytes (335 MB) copied, 3.15001 s, 106 MB/s
>>>>
>>>>   One SATA disk has better rate than cephfs which consist of 24 the same disks.
>>>>
>>>>   --
>>>>   Best Regards,
>>>>   Stanislav Butkeev
>>>>
>>>>   15.10.2015, 21:49, "John Spray" <jspray@xxxxxxxxxx>:
>>>>>   On Thu, Oct 15, 2015 at 5:11 PM, Butkeev Stas <staerist@xxxxx> wrote:
>>>>>>    Hello all,
>>>>>>    Does anybody try to use cephfs?
>>>>>>
>>>>>>    I have two servers with RHEL7.1(latest kernel 3.10.0-229.14.1.el7.x86_64). Each server has 15G flash for ceph journal and 12*2Tb SATA disk for data.
>>>>>>    I have Infiniband(ipoib) 56Gb/s interconnect between nodes.
>>>>>>
>>>>>>    Cluster version
>>>>>>    # ceph -v
>>>>>>    ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
>>>>>>
>>>>>>    Cluster config
>>>>>>    # cat /etc/ceph/ceph.conf
>>>>>>    [global]
>>>>>>            auth service required = cephx
>>>>>>            auth client required = cephx
>>>>>>            auth cluster required = cephx
>>>>>>            fsid = 0f05deaf-ee6f-4342-b589-5ecf5527aa6f
>>>>>>            mon osd full ratio = .95
>>>>>>            mon osd nearfull ratio = .90
>>>>>>            osd pool default size = 2
>>>>>>            osd pool default min size = 1
>>>>>>            osd pool default pg num = 32
>>>>>>            osd pool default pgp num = 32
>>>>>>            max open files = 131072
>>>>>>            osd crush chooseleaf type = 1
>>>>>>    [mds]
>>>>>>
>>>>>>    [mds.a]
>>>>>>            host = ak34
>>>>>>
>>>>>>    [mon]
>>>>>>            mon_initial_members = a,b
>>>>>>
>>>>>>    [mon.a]
>>>>>>            host = ak34
>>>>>>            mon addr = 172.24.32.134:6789
>>>>>>
>>>>>>    [mon.b]
>>>>>>            host = ak35
>>>>>>            mon addr = 172.24.32.135:6789
>>>>>>
>>>>>>    [osd]
>>>>>>            osd journal size = 1000
>>>>>>
>>>>>>    [osd.0]
>>>>>>            osd uuid = b3b3cd37-8df5-4455-8104-006ddba2c443
>>>>>>            host = ak34
>>>>>>            public addr = 172.24.32.134
>>>>>>            osd journal = /CEPH_JOURNAL/osd/ceph-0/journal
>>>>>>    .....
>>>>>>
>>>>>>    Below tree of cluster
>>>>>>    # ceph osd tree
>>>>>>    ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
>>>>>>    -1 45.75037 root default
>>>>>>    -2 45.75037 region RU
>>>>>>    -3 45.75037 datacenter ru-msk-ak48t
>>>>>>    -4 22.87518 host ak34
>>>>>>     0 1.90627 osd.0 up 1.00000 1.00000
>>>>>>     1 1.90627 osd.1 up 1.00000 1.00000
>>>>>>     2 1.90627 osd.2 up 1.00000 1.00000
>>>>>>     3 1.90627 osd.3 up 1.00000 1.00000
>>>>>>     4 1.90627 osd.4 up 1.00000 1.00000
>>>>>>     5 1.90627 osd.5 up 1.00000 1.00000
>>>>>>     6 1.90627 osd.6 up 1.00000 1.00000
>>>>>>     7 1.90627 osd.7 up 1.00000 1.00000
>>>>>>     8 1.90627 osd.8 up 1.00000 1.00000
>>>>>>     9 1.90627 osd.9 up 1.00000 1.00000
>>>>>>    10 1.90627 osd.10 up 1.00000 1.00000
>>>>>>    11 1.90627 osd.11 up 1.00000 1.00000
>>>>>>    -5 22.87518 host ak35
>>>>>>    12 1.90627 osd.12 up 1.00000 1.00000
>>>>>>    13 1.90627 osd.13 up 1.00000 1.00000
>>>>>>    14 1.90627 osd.14 up 1.00000 1.00000
>>>>>>    15 1.90627 osd.15 up 1.00000 1.00000
>>>>>>    16 1.90627 osd.16 up 1.00000 1.00000
>>>>>>    17 1.90627 osd.17 up 1.00000 1.00000
>>>>>>    18 1.90627 osd.18 up 1.00000 1.00000
>>>>>>    19 1.90627 osd.19 up 1.00000 1.00000
>>>>>>    20 1.90627 osd.20 up 1.00000 1.00000
>>>>>>    21 1.90627 osd.21 up 1.00000 1.00000
>>>>>>    22 1.90627 osd.22 up 1.00000 1.00000
>>>>>>    23 1.90627 osd.23 up 1.00000 1.00000
>>>>>>
>>>>>>    Status of cluster
>>>>>>    # ceph -s
>>>>>>        cluster 0f05deaf-ee6f-4342-b589-5ecf5527aa6f
>>>>>>         health HEALTH_OK
>>>>>>         monmap e1: 2 mons at {a=172.24.32.134:6789/0,b=172.24.32.135:6789/0}
>>>>>>                election epoch 10, quorum 0,1 a,b
>>>>>>         mdsmap e14: 1/1/1 up {0=a=up:active}
>>>>>>         osdmap e194: 24 osds: 24 up, 24 in
>>>>>>          pgmap v2305: 384 pgs, 3 pools, 271 GB data, 72288 objects
>>>>>>                545 GB used, 44132 GB / 44678 GB avail
>>>>>>                     384 active+clean
>>>>>>
>>>>>>    Pools for cephfs
>>>>>>    ]# ceph osd dump|grep pg
>>>>>>    pool 1 'cephfs_data' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 154 flags hashpspool crash_replay_interval 45 stripe_width 0
>>>>>>    pool 2 'cephfs_metadata' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 144 flags hashpspool stripe_width 0
>>>>>>
>>>>>>    Rados bench
>>>>>>    # rados bench -p cephfs_data 300 write --no-cleanup && rados bench -p cephfs_data 300 seq
>>>>>>     Maintaining 16 concurrent writes of 4194304 bytes for up to 300 seconds or 0 objects
>>>>>>     Object prefix: benchmark_data_XXXXXXXXXXXXXXXXXXXX_8108
>>>>>>       sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
>>>>>>         0 0 0 0 0 0 - 0
>>>>>>         1 16 170 154 615.74 616 0.109984 0.0978277
>>>>>>         2 16 335 319 637.817 660 0.0623079 0.0985001
>>>>>>         3 16 496 480 639.852 644 0.0992808 0.0982317
>>>>>>         4 16 662 646 645.862 664 0.0683485 0.0980203
>>>>>>         5 16 831 815 651.796 676 0.0773545 0.0973635
>>>>>>         6 15 994 979 652.479 656 0.112323 0.096901
>>>>>>         7 16 1164 1148 655.826 676 0.107592 0.0969845
>>>>>>         8 16 1327 1311 655.335 652 0.0960067 0.0968445
>>>>>>         9 16 1488 1472 654.066 644 0.0780589 0.0970879
>>>>>>
>>>>>>    .....
>>>>>>       297 16 43445 43429 584.811 596 0.0569516 0.109399
>>>>>>       298 16 43601 43585 584.942 624 0.0707439 0.109388
>>>>>>       299 16 43756 43740 585.059 620 0.20408 0.109363
>>>>>>    2015-10-15 14:16:59.622610min lat: 0.0109677 max lat: 0.951389 avg lat: 0.109344
>>>>>>       sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
>>>>>>       300 13 43901 43888 585.082 592 0.0768806 0.109344
>>>>>>     Total time run: 300.329089
>>>>>>    Total reads made: 43901
>>>>>>    Read size: 4194304
>>>>>>    Bandwidth (MB/sec): 584.705
>>>>>>
>>>>>>    Average Latency: 0.109407
>>>>>>    Max latency: 0.951389
>>>>>>    Min latency: 0.0109677
>>>>>>
>>>>>>    But real write speed is very low
>>>>>>
>>>>>>    # dd if=/dev/zero|pv|dd oflag=direct of=44444 bs=4k count=10k
>>>>>>    10240+0 records in1.5MiB/s] [ <=> ]
>>>>>>    10240+0 records out
>>>>>>    41943040 bytes (42 MB) copied, 25.9155 s, 1.6 MB/s
>>>>>>    40.1MiB 0:00:25 [1.55MiB/s] [ <=> ]
>>>>>>
>>>>>>    # dd if=/dev/zero|pv|dd oflag=direct of=44444 bs=32k count=10k
>>>>>>    10240+0 records in0.5MiB/s] [ <=> ]
>>>>>>    10240+0 records out
>>>>>>    335544320 bytes (336 MB) copied, 28.2998 s, 11.9 MB/s
>>>>>>     320MiB 0:00:28 [11.3MiB/s] [ <=> ]
>>>>>
>>>>>   So what happens if you continue increasing the 'bs' parameter? Is
>>>>>   bs=1M nice and fast?
>>>>>
>>>>>   John
>>>>>
>>>>>>    Do you know of root cause of low speed of write to FS?
>>>>>>
>>>>>>    Thank you for help in advance!!
>>>>>>
>>>>>>    --
>>>>>>    Best Regards,
>>>>>>    Stanislav Butkeev
>>>>>>    _______________________________________________
>>>>>>    ceph-users mailing list
>>>>>>    ceph-users@xxxxxxxxxxxxxx
>>>>>>    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux