Hello Max, It is 15G scsi disk which was exported from Flash array to server. # multipath -ll XXXXXXXXXXXXXXXXXXXXXXXXX dm-3 XXXXXXXXXX size=15G features='0' hwhandler='0' wp=rw `-+- policy='round-robin 0' prio=1 status=active |- 5:0:0:2 sdp 8:240 active ready running |- 4:0:0:2 sdq 65:0 active ready running |- 6:0:0:2 sds 65:32 active ready running `- 7:0:0:2 sdu 65:64 active ready running In config you can see option "osd journal size = 1000". I use 12G on each node for ceph journal For example # ls -l /CEPH_JOURNAL/*/* /CEPH_JOURNAL/osd/ceph-0: total 1024000 -rw-r--r-- 1 root root 1048576000 Oct 15 19:03 journal /CEPH_JOURNAL/osd/ceph-1: total 1024000 -rw-r--r-- 1 root root 1048576000 Oct 15 19:03 journal /CEPH_JOURNAL/osd/ceph-10: total 1024000 -rw-r--r-- 1 root root 1048576000 Oct 15 19:04 journal /CEPH_JOURNAL/osd/ceph-11: total 1024000 -rw-r--r-- 1 root root 1048576000 Oct 15 19:03 journal /CEPH_JOURNAL/osd/ceph-2: total 1024000 -rw-r--r-- 1 root root 1048576000 Oct 15 19:03 journal /CEPH_JOURNAL/osd/ceph-3: total 1024000 -rw-r--r-- 1 root root 1048576000 Oct 15 19:03 journal ....... -- Best Regards, Stanislav Butkeev 15.10.2015, 23:26, "Max Yehorov" <myehorov@xxxxxxxxxx>: > Stas, > > as you said: "Each server has 15G flash for ceph journal and 12*2Tb > SATA disk for" > > What is this 15G flash and is it used for all 12 SATA drives? > > On Thu, Oct 15, 2015 at 1:05 PM, John Spray <jspray@xxxxxxxxxx> wrote: >> On Thu, Oct 15, 2015 at 8:46 PM, Butkeev Stas <staerist@xxxxx> wrote: >>> Thank you for your comment. I know what does mean option oflag=direct and other things about stress testing. >>> Unfortunately speed is very slow for this cluster FS. >>> >>> The same test on another cluster FS(GPFS) which consist of 4 disks >>> >>> # dd if=/dev/zero|pv|dd oflag=direct of=99999 bs=4k count=10k >>> 40.1MB 0:00:05 [7.57MB/s] [ <=> ] >>> 10240+0 records in >>> 10240+0 records out >>> 41943040 bytes (42 MB) copied, 5.27816 s, 7.9 MB/s >>> >>> I hope that I miss some options during configuration or something else. >> >> I don't know much about GPFS internals, since it's proprietary, but a >> quick google brings us here: >> http://www-01.ibm.com/support/knowledgecenter/SSFKCN_4.1.0.4/com.ibm.cluster.gpfs.v4r104.gpfs100.doc/bl1adm_considerations_direct_io.htm >> >> It appears that GPFS only respects O_DIRECT in certain circumstances, >> and in some circumstances will use their "pagepool" cache even when >> direct IO is requested. You would probably need to check with IBM to >> work out exactly whether true direct IO is happening when you run on >> GPFS. >> >> John >> >>> -- >>> Best Regards, >>> Stanislav Butkeev >>> >>> 15.10.2015, 22:36, "John Spray" <jspray@xxxxxxxxxx>: >>>> On Thu, Oct 15, 2015 at 8:17 PM, Butkeev Stas <staerist@xxxxx> wrote: >>>>> Hello John >>>>> >>>>> Yes, of course, write speed is rising, because we are increasing amount of data per one operation by disk. >>>>> But, do you know at least one software which write data by 1Mb blocks? I don't know, you too. >>>> >>>> Plenty of applications do large writes, especially if they're intended >>>> for use on network filesystems. >>>> >>>> When you pass oflag=direct, you are asking the kernel to send these >>>> writes individually instead of aggregating them in the page cache. >>>> What you're measuring here is effectively the issue rate of small >>>> messages, rather than the speed at which data can be written to ceph. >>>> >>>> Try the same benchmark with NFS, you'll get a similar scaling with block size. >>>> >>>> Cheers, >>>> John >>>> >>>> If you want to aggregate these writes in the page cache before sending >>>> them over the network, I imagine you probably need to disable direct >>>> IO. >>>> >>>>> Simple test: dd to common 2Tb SATA disk >>>>> >>>>> # dd if=/dev/zero|pv|dd oflag=direct of=/dev/sdi bs=4k count=1M >>>>> 4GiB 0:00:46 [87.2MiB/s] [ <=> ] >>>>> 1048576+0 records in >>>>> 1048576+0 records out >>>>> 4294967296 bytes (4.3 GB) copied, 46.9688 s, 91.4 MB/s >>>>> >>>>> # dd if=/dev/zero|pv|dd oflag=direct of=/dev/sdi bs=32k count=10k >>>>> dd: warning: partial read (24576 bytes); suggest iflag=fullblock >>>>> 319MiB 0:00:03 [ 103MiB/s] [ <=> ] >>>>> 10219+21 records in >>>>> 10219+21 records out >>>>> 335262720 bytes (335 MB) copied, 3.15001 s, 106 MB/s >>>>> >>>>> One SATA disk has better rate than cephfs which consist of 24 the same disks. >>>>> >>>>> -- >>>>> Best Regards, >>>>> Stanislav Butkeev >>>>> >>>>> 15.10.2015, 21:49, "John Spray" <jspray@xxxxxxxxxx>: >>>>>> On Thu, Oct 15, 2015 at 5:11 PM, Butkeev Stas <staerist@xxxxx> wrote: >>>>>>> Hello all, >>>>>>> Does anybody try to use cephfs? >>>>>>> >>>>>>> I have two servers with RHEL7.1(latest kernel 3.10.0-229.14.1.el7.x86_64). Each server has 15G flash for ceph journal and 12*2Tb SATA disk for data. >>>>>>> I have Infiniband(ipoib) 56Gb/s interconnect between nodes. >>>>>>> >>>>>>> Cluster version >>>>>>> # ceph -v >>>>>>> ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b) >>>>>>> >>>>>>> Cluster config >>>>>>> # cat /etc/ceph/ceph.conf >>>>>>> [global] >>>>>>> auth service required = cephx >>>>>>> auth client required = cephx >>>>>>> auth cluster required = cephx >>>>>>> fsid = 0f05deaf-ee6f-4342-b589-5ecf5527aa6f >>>>>>> mon osd full ratio = .95 >>>>>>> mon osd nearfull ratio = .90 >>>>>>> osd pool default size = 2 >>>>>>> osd pool default min size = 1 >>>>>>> osd pool default pg num = 32 >>>>>>> osd pool default pgp num = 32 >>>>>>> max open files = 131072 >>>>>>> osd crush chooseleaf type = 1 >>>>>>> [mds] >>>>>>> >>>>>>> [mds.a] >>>>>>> host = ak34 >>>>>>> >>>>>>> [mon] >>>>>>> mon_initial_members = a,b >>>>>>> >>>>>>> [mon.a] >>>>>>> host = ak34 >>>>>>> mon addr = 172.24.32.134:6789 >>>>>>> >>>>>>> [mon.b] >>>>>>> host = ak35 >>>>>>> mon addr = 172.24.32.135:6789 >>>>>>> >>>>>>> [osd] >>>>>>> osd journal size = 1000 >>>>>>> >>>>>>> [osd.0] >>>>>>> osd uuid = b3b3cd37-8df5-4455-8104-006ddba2c443 >>>>>>> host = ak34 >>>>>>> public addr = 172.24.32.134 >>>>>>> osd journal = /CEPH_JOURNAL/osd/ceph-0/journal >>>>>>> ..... >>>>>>> >>>>>>> Below tree of cluster >>>>>>> # ceph osd tree >>>>>>> ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY >>>>>>> -1 45.75037 root default >>>>>>> -2 45.75037 region RU >>>>>>> -3 45.75037 datacenter ru-msk-ak48t >>>>>>> -4 22.87518 host ak34 >>>>>>> 0 1.90627 osd.0 up 1.00000 1.00000 >>>>>>> 1 1.90627 osd.1 up 1.00000 1.00000 >>>>>>> 2 1.90627 osd.2 up 1.00000 1.00000 >>>>>>> 3 1.90627 osd.3 up 1.00000 1.00000 >>>>>>> 4 1.90627 osd.4 up 1.00000 1.00000 >>>>>>> 5 1.90627 osd.5 up 1.00000 1.00000 >>>>>>> 6 1.90627 osd.6 up 1.00000 1.00000 >>>>>>> 7 1.90627 osd.7 up 1.00000 1.00000 >>>>>>> 8 1.90627 osd.8 up 1.00000 1.00000 >>>>>>> 9 1.90627 osd.9 up 1.00000 1.00000 >>>>>>> 10 1.90627 osd.10 up 1.00000 1.00000 >>>>>>> 11 1.90627 osd.11 up 1.00000 1.00000 >>>>>>> -5 22.87518 host ak35 >>>>>>> 12 1.90627 osd.12 up 1.00000 1.00000 >>>>>>> 13 1.90627 osd.13 up 1.00000 1.00000 >>>>>>> 14 1.90627 osd.14 up 1.00000 1.00000 >>>>>>> 15 1.90627 osd.15 up 1.00000 1.00000 >>>>>>> 16 1.90627 osd.16 up 1.00000 1.00000 >>>>>>> 17 1.90627 osd.17 up 1.00000 1.00000 >>>>>>> 18 1.90627 osd.18 up 1.00000 1.00000 >>>>>>> 19 1.90627 osd.19 up 1.00000 1.00000 >>>>>>> 20 1.90627 osd.20 up 1.00000 1.00000 >>>>>>> 21 1.90627 osd.21 up 1.00000 1.00000 >>>>>>> 22 1.90627 osd.22 up 1.00000 1.00000 >>>>>>> 23 1.90627 osd.23 up 1.00000 1.00000 >>>>>>> >>>>>>> Status of cluster >>>>>>> # ceph -s >>>>>>> cluster 0f05deaf-ee6f-4342-b589-5ecf5527aa6f >>>>>>> health HEALTH_OK >>>>>>> monmap e1: 2 mons at {a=172.24.32.134:6789/0,b=172.24.32.135:6789/0} >>>>>>> election epoch 10, quorum 0,1 a,b >>>>>>> mdsmap e14: 1/1/1 up {0=a=up:active} >>>>>>> osdmap e194: 24 osds: 24 up, 24 in >>>>>>> pgmap v2305: 384 pgs, 3 pools, 271 GB data, 72288 objects >>>>>>> 545 GB used, 44132 GB / 44678 GB avail >>>>>>> 384 active+clean >>>>>>> >>>>>>> Pools for cephfs >>>>>>> ]# ceph osd dump|grep pg >>>>>>> pool 1 'cephfs_data' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 154 flags hashpspool crash_replay_interval 45 stripe_width 0 >>>>>>> pool 2 'cephfs_metadata' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 144 flags hashpspool stripe_width 0 >>>>>>> >>>>>>> Rados bench >>>>>>> # rados bench -p cephfs_data 300 write --no-cleanup && rados bench -p cephfs_data 300 seq >>>>>>> Maintaining 16 concurrent writes of 4194304 bytes for up to 300 seconds or 0 objects >>>>>>> Object prefix: benchmark_data_XXXXXXXXXXXXXXXXXXXX_8108 >>>>>>> sec Cur ops started finished avg MB/s cur MB/s last lat avg lat >>>>>>> 0 0 0 0 0 0 - 0 >>>>>>> 1 16 170 154 615.74 616 0.109984 0.0978277 >>>>>>> 2 16 335 319 637.817 660 0.0623079 0.0985001 >>>>>>> 3 16 496 480 639.852 644 0.0992808 0.0982317 >>>>>>> 4 16 662 646 645.862 664 0.0683485 0.0980203 >>>>>>> 5 16 831 815 651.796 676 0.0773545 0.0973635 >>>>>>> 6 15 994 979 652.479 656 0.112323 0.096901 >>>>>>> 7 16 1164 1148 655.826 676 0.107592 0.0969845 >>>>>>> 8 16 1327 1311 655.335 652 0.0960067 0.0968445 >>>>>>> 9 16 1488 1472 654.066 644 0.0780589 0.0970879 >>>>>>> >>>>>>> ..... >>>>>>> 297 16 43445 43429 584.811 596 0.0569516 0.109399 >>>>>>> 298 16 43601 43585 584.942 624 0.0707439 0.109388 >>>>>>> 299 16 43756 43740 585.059 620 0.20408 0.109363 >>>>>>> 2015-10-15 14:16:59.622610min lat: 0.0109677 max lat: 0.951389 avg lat: 0.109344 >>>>>>> sec Cur ops started finished avg MB/s cur MB/s last lat avg lat >>>>>>> 300 13 43901 43888 585.082 592 0.0768806 0.109344 >>>>>>> Total time run: 300.329089 >>>>>>> Total reads made: 43901 >>>>>>> Read size: 4194304 >>>>>>> Bandwidth (MB/sec): 584.705 >>>>>>> >>>>>>> Average Latency: 0.109407 >>>>>>> Max latency: 0.951389 >>>>>>> Min latency: 0.0109677 >>>>>>> >>>>>>> But real write speed is very low >>>>>>> >>>>>>> # dd if=/dev/zero|pv|dd oflag=direct of=44444 bs=4k count=10k >>>>>>> 10240+0 records in1.5MiB/s] [ <=> ] >>>>>>> 10240+0 records out >>>>>>> 41943040 bytes (42 MB) copied, 25.9155 s, 1.6 MB/s >>>>>>> 40.1MiB 0:00:25 [1.55MiB/s] [ <=> ] >>>>>>> >>>>>>> # dd if=/dev/zero|pv|dd oflag=direct of=44444 bs=32k count=10k >>>>>>> 10240+0 records in0.5MiB/s] [ <=> ] >>>>>>> 10240+0 records out >>>>>>> 335544320 bytes (336 MB) copied, 28.2998 s, 11.9 MB/s >>>>>>> 320MiB 0:00:28 [11.3MiB/s] [ <=> ] >>>>>> >>>>>> So what happens if you continue increasing the 'bs' parameter? Is >>>>>> bs=1M nice and fast? >>>>>> >>>>>> John >>>>>> >>>>>>> Do you know of root cause of low speed of write to FS? >>>>>>> >>>>>>> Thank you for help in advance!! >>>>>>> >>>>>>> -- >>>>>>> Best Regards, >>>>>>> Stanislav Butkeev >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list >>>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com