Hi Vadim, many thanks for these benchmark results! This indeed looks extremely similar to what we achieve after enabling connected mode. Our 6 OSD-hosts are Supermicro systems with 2 HDDs (Raid 1) for the OS, and 32 HDDs (4 TB) + 2 SSDs for the OSDs. The 2 SSDs have 16 LVM volumes each (which have ~ 6.7 GB each) to contain the Bluestore BlockDB for the 32 OSDs. So in our case, we have 32 OSDs "behind" one IPoIB link, and the link is clearly defining the limit. Also we are running an EC pool, so inter-OSD traffic for any read and write operation is heavy. If I perform a test with "iperf -d" (i.e. send and receive in parallel), I sadly note that the observed limitation to ~20 GBit/s which you also get is on the sum of both directions. My expectation is that also for you the limit might be given by the IPoIB link speed - the disks could probably do much faster, especially if you change to Bluestore. Our workload, by the way, is also HPC - or maybe rather, HTC (High Throughput Computing), but luckily our users are used to a significantly slower filesystem from the old cluster and will likely not make use of the throughput we can already achieve with IPoIB. Many thanks again for sharing your benchmarks! Cheers, Oliver Am 22.02.2018 um 13:15 schrieb Vadim Bulst: > Hi Oliver, > > i also use Infiniband and Cephfs for HPC purposes. > > My setup: > > * 4x Dell R730xd and expansion shelf, 24 OSD à 8TB, 128GB Ram, 2x10Core Intel 4th Gen, Mellanox ConnectX-3, no SSD-Cache > > * 7x Dell R630 Clients > > * Ceph-Cluster running on Ubuntu Xenial and Ceph Jewel deployed with Ceph-Ansible > * Cephfs-Clients on Debian Stretch and Cephfs kernel module > > * IPoverIB for public and custer network, IB-adapters are in connected mode and MTU is 65520 > > > Future improvements: moving cephfs_metadata-pool to a NVMe pool , update to Luminous and Bluestore > > root@polstor02:/home/urzadmin# ceph -s > cluster 7c4bfd06-046f-49e4-bb77-0402d7ca98e5 > health HEALTH_OK > monmap e2: 3 mons at {polstor01=10.10.144.211:6789/0,polstor02=10.10.144.212:6789/0,polstor03=10.10.144.213:6789/0} > election epoch 5034, quorum 0,1,2 polstor01,polstor02,polstor03 > fsmap e2091562: 1/1/1 up {0=polstor02=up:active}, 1 up:standby-replay, 1 up:standby > osdmap e2078945: 95 osds: 95 up, 95 in > flags sortbitwise,require_jewel_osds > pgmap v8638409: 4224 pgs, 2 pools, 93414 GB data, 34592 kobjects > 274 TB used, 416 TB / 690 TB avail > 4221 active+clean > 3 active+clean+scrubbing+deep > client io 1658 B/s rd, 3 op/s rd, 0 op/s wr > > > These are my messurements: > > ------------------------------------------------------------ > Server listening on TCP port 5001 > TCP window size: 85.3 KByte (default) > ------------------------------------------------------------ > [ 4] local 10.10.144.213 port 5001 connected with 10.10.144.212 port 42584 > [ ID] Interval Transfer Bandwidth > [ 4] 0.0-10.0 sec 27.2 GBytes 23.3 Gbits/sec > [ 5] local 10.10.144.213 port 5001 connected with 10.10.144.212 port 42586 > [ 5] 0.0-10.0 sec 25.4 GBytes 21.8 Gbits/sec > [ 4] local 10.10.144.213 port 5001 connected with 10.10.144.212 port 42588 > [ 4] 0.0-10.0 sec 19.9 GBytes 17.1 Gbits/sec > [ 5] local 10.10.144.213 port 5001 connected with 10.10.144.212 port 42590 > [ 5] 0.0-10.0 sec 20.2 GBytes 17.3 Gbits/sec > [ 4] local 10.10.144.213 port 5001 connected with 10.10.144.212 port 42592 > [ 4] 0.0-10.0 sec 30.2 GBytes 25.9 Gbits/sec > [ 5] local 10.10.144.213 port 5001 connected with 10.10.144.212 port 42594 > [ 5] 0.0-10.0 sec 26.1 GBytes 22.4 Gbits/sec > > root@polstor02:/home/urzadmin# rados bench -p cephfs_data 10 write --no-cleanup -t 40 [1220/1945] > Maintaining 40 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects > Object prefix: benchmark_data_polstor02_3189601 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) > 0 0 0 0 0 0 - 0 > 1 39 262 223 891.992 892 0.0952355 0.156985 > 2 39 497 458 915.934 940 0.129115 0.162122 > 3 39 675 636 847.921 712 0.557279 0.172988 > 4 39 857 818 817.921 728 0.154144 0.186755 > 5 39 1042 1003 802.315 740 0.135748 0.191932 > 6 39 1223 1184 789.248 724 0.13996 0.197136 > 7 39 1411 1372 783.912 752 0.204627 0.196429 > 8 39 1556 1517 758.414 580 0.253825 0.201344 > 9 39 1722 1683 747.916 664 0.175682 0.209318 > 10 39 1866 1827 730.715 576 0.37722 0.212927 > Total time run: 10.503421 > Total writes made: 1867 > Write size: 4194304 > Object size: 4194304 > Bandwidth (MB/sec): 711.006 > Stddev Bandwidth: 116.36 > Max bandwidth (MB/sec): 940 > Min bandwidth (MB/sec): 576 > Average IOPS: 177 > Stddev IOPS: 29 > Max IOPS: 235 > Min IOPS: 144 > Average Latency(s): 0.222746 > Stddev Latency(s): 0.160678 > Max latency(s): 2.68037 > Min latency(s): 0.0621196 > > > > root@polstor02:/home/urzadmin# rados bench -p cephfs_data 10 rand > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) > 0 0 0 0 0 0 - 0 > 1 15 1088 1073 4290.71 4292 0.0137212 0.0139589 > 2 15 2191 2176 4351.04 4412 0.0126225 0.0138207 > 3 15 3327 3312 4415.12 4544 0.013692 0.0136327 > 4 15 4498 4483 4482.1 4684 0.0103933 0.0134332 > 5 15 5677 5662 4528.77 4716 0.0115474 0.0132968 > 6 15 6836 6821 4546.5 4636 0.0147042 0.0132476 > 7 15 7967 7952 4543.19 4524 0.0138084 0.0132329 > 8 15 9152 9137 4567.71 4740 0.0150901 0.013193 > 9 15 10276 10261 4559.68 4496 0.0126462 0.0132172 > 10 15 11424 11409 4562.83 4592 0.0139788 0.0132104 > Total time run: 10.020400 > Total reads made: 11424 > Read size: 4194304 > Object size: 4194304 > Bandwidth (MB/sec): 4560.3 > Average IOPS: 1140 > Stddev IOPS: 35 > Max IOPS: 1185 > Min IOPS: 1073 > Average Latency(s): 0.0132159 > Max latency(s): 0.316514 > Min latency(s): 0.00687372 > > In therms of native RDMA-/IB-support - well it would be really nice if the Ceph community is pushing this feature. There is a big scientific community interested in using Ceph for HPC-workloads. > > Cheers, > > Vadim > > > On 02/18/2018 04:03 PM, Oliver Freyermuth wrote: >> Dear Cephalopodians, >> >> we are just getting started with our first Ceph cluster (Luminous 12.2.2) and doing some basic benchmarking. >> >> We have two pools: >> - cephfs_metadata, living on 4 SSD devices (each is a bluestore OSD, 240 GB) on 2 hosts (i.e. 2 SSDs each), setup as: >> - replicated, min size 2, max size 4 >> - 128 PGs >> - cephfs_data, living on 6 hosts each of which has the following setup: >> - 32 HDD drives (4 TB) each of which is a bluestore OSD, the LSI controller to which they are attached is in JBOD personality >> - 2 SSD drives, each has 16 partitions with 7 GB per partition, used as block-db by the bluestore OSDs living on the HDDs. >> - Created with: >> ceph osd erasure-code-profile set cephfs_data k=4 m=2 crush-device-class=hdd crush-failure-domain=host >> ceph osd pool create cephfs_data 2048 2048 erasure cephfs_data >> - So to summarize: 192 OSDs, 2048 PGs, each OSD has 4 TB data + 7 GB block-db >> >> The interconnect (public and cluster network) >> is made via IP over Infiniband (56 GBit bandwidth), using the software stack that comes with CentOS 7. >> >> This leaves us with the possibility that one of the metadata-hosts can fail, and still one of the disks can fail. >> For the data hosts, up to two machines total can fail. >> >> We have 40 clients connected to this cluster. We now run something like: >> dd if=/dev/zero of=some_file bs=1M count=10000 >> on each CPU core of each of the clients, yielding a total of 1120 writing processes (all 40 clients have 28+28HT cores), >> using the ceph-fuse client. >> >> This yields a write throughput of a bit below 1 GB/s (capital B), which is unexpectedly low. >> Running a BeeGFS on the same cluster before (disks were in RAID 6 in that case) yielded throughputs of about 12 GB/s, >> but came with other issues (e.g. it's not FOSS...), so we'd love to run Ceph :-). >> >> I performed some basic tests to try to understand the bottleneck for Ceph: >> # rados bench -p cephfs_data 10 write --no-cleanup -t 40 >> Bandwidth (MB/sec): 695.952 >> Stddev Bandwidth: 295.223 >> Max bandwidth (MB/sec): 1088 >> Min bandwidth (MB/sec): 76 >> Average IOPS: 173 >> Stddev IOPS: 73 >> Max IOPS: 272 >> Min IOPS: 19 >> Average Latency(s): 0.220967 >> Stddev Latency(s): 0.305967 >> Max latency(s): 2.88931 >> Min latency(s): 0.0741061 >> >> => This agrees mostly with our basic dd benchmark. >> >> Reading is a bit faster: >> # rados bench -p cephfs_data 10 rand >> => Bandwidth (MB/sec): 1108.75 >> >> However, the disks are reasonably quick: >> # ceph tell osd.0 bench >> { >> "bytes_written": 1073741824, >> "blocksize": 4194304, >> "bytes_per_sec": 331850403 >> } >> >> I checked and the OSD-hosts peaked at a load average of about 22 (they have 24+24HT cores) in our dd benchmark, >> but stayed well below that (only about 20 % per OSD daemon) in the rados bench test. >> One idea would be to switch from jerasure to ISA, since the machines are all Intel CPUs only anyways. >> >> Already tried: >> - TCP stack tuning (wmem, rmem), no huge effect. >> - changing the block sizes used by dd, no effect. >> - Testing network throughput with ib_write_bw, this revealed something like: >> #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] >> 2 5000 19.73 19.30 10.118121 >> 4 5000 52.79 51.70 13.553412 >> 8 5000 101.23 96.65 12.668371 >> 16 5000 243.66 233.42 15.297583 >> 32 5000 350.66 344.73 11.296089 >> 64 5000 909.14 324.85 5.322323 >> 128 5000 1424.84 1401.29 11.479374 >> 256 5000 2865.24 2801.04 11.473055 >> 512 5000 5169.98 5095.08 10.434733 >> 1024 5000 10022.75 9791.42 10.026410 >> 2048 5000 10988.64 10628.83 5.441958 >> 4096 5000 11401.40 11399.14 2.918180 >> [...] >> >> So it seems the IP-over-Infiniband is not the bottleneck (BeeGFS was using RDMA). >> Other ideas that come to mind: >> - Testing with Ceph-RDMA, but that does not seem production-ready yet, if I read the list correctly. >> - Increasing osd_pool_erasure_code_stripe_width. >> - Using ISA as EC plugin. >> - Reducing the bluestore_cache_size_hdd, it seems when recovery + benchmark is ongoing, swap is used (but not when performing benchmarking only, >> so this should not explain the slowdown). >> >> However, since we are just beginning with Ceph, it may well be we are missing something basic, but crucial here. >> For example, could it be that the block-db storage is too small? How to find out? >> >> Do any ideas come to mind? >> >> A second, hopefully easier question: >> If one OSD-host fails in our setup, all PGs are changed to "active+clean+remapped" and lots of data is moved. >> I understand the remapping is needed, but why is data actually moved? With k=4 and m=2, failure domain=host, >> and 6 hosts of which one is down, there should be no advantage for redundancy by moving data around after one host gone down - or do I miss something here? >> >> Cheers and many thanks in advance, >> Oliver >> >> >> >> >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- > Vadim Bulst > > Universität Leipzig / URZ > 04109 Leipzig, Augustusplatz 10 > > phone: +49-341-97-33380 > mail: vadim.bulst@xxxxxxxxxxxxxx > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com