hi oliver, the IPoIB network is not 56gb, it's probably a lot less (20gb or so). the ib_write_bw test is verbs/rdma based. do you have iperf tests between hosts, and if so, can you share those reuslts? stijn > we are just getting started with our first Ceph cluster (Luminous 12.2.2) and doing some basic benchmarking. > > We have two pools: > - cephfs_metadata, living on 4 SSD devices (each is a bluestore OSD, 240 GB) on 2 hosts (i.e. 2 SSDs each), setup as: > - replicated, min size 2, max size 4 > - 128 PGs > - cephfs_data, living on 6 hosts each of which has the following setup: > - 32 HDD drives (4 TB) each of which is a bluestore OSD, the LSI controller to which they are attached is in JBOD personality > - 2 SSD drives, each has 16 partitions with 7 GB per partition, used as block-db by the bluestore OSDs living on the HDDs. > - Created with: > ceph osd erasure-code-profile set cephfs_data k=4 m=2 crush-device-class=hdd crush-failure-domain=host > ceph osd pool create cephfs_data 2048 2048 erasure cephfs_data > - So to summarize: 192 OSDs, 2048 PGs, each OSD has 4 TB data + 7 GB block-db > > The interconnect (public and cluster network) > is made via IP over Infiniband (56 GBit bandwidth), using the software stack that comes with CentOS 7. > > This leaves us with the possibility that one of the metadata-hosts can fail, and still one of the disks can fail. > For the data hosts, up to two machines total can fail. > > We have 40 clients connected to this cluster. We now run something like: > dd if=/dev/zero of=some_file bs=1M count=10000 > on each CPU core of each of the clients, yielding a total of 1120 writing processes (all 40 clients have 28+28HT cores), > using the ceph-fuse client. > > This yields a write throughput of a bit below 1 GB/s (capital B), which is unexpectedly low. > Running a BeeGFS on the same cluster before (disks were in RAID 6 in that case) yielded throughputs of about 12 GB/s, > but came with other issues (e.g. it's not FOSS...), so we'd love to run Ceph :-). > > I performed some basic tests to try to understand the bottleneck for Ceph: > # rados bench -p cephfs_data 10 write --no-cleanup -t 40 > Bandwidth (MB/sec): 695.952 > Stddev Bandwidth: 295.223 > Max bandwidth (MB/sec): 1088 > Min bandwidth (MB/sec): 76 > Average IOPS: 173 > Stddev IOPS: 73 > Max IOPS: 272 > Min IOPS: 19 > Average Latency(s): 0.220967 > Stddev Latency(s): 0.305967 > Max latency(s): 2.88931 > Min latency(s): 0.0741061 > > => This agrees mostly with our basic dd benchmark. > > Reading is a bit faster: > # rados bench -p cephfs_data 10 rand > => Bandwidth (MB/sec): 1108.75 > > However, the disks are reasonably quick: > # ceph tell osd.0 bench > { > "bytes_written": 1073741824, > "blocksize": 4194304, > "bytes_per_sec": 331850403 > } > > I checked and the OSD-hosts peaked at a load average of about 22 (they have 24+24HT cores) in our dd benchmark, > but stayed well below that (only about 20 % per OSD daemon) in the rados bench test. > One idea would be to switch from jerasure to ISA, since the machines are all Intel CPUs only anyways. > > Already tried: > - TCP stack tuning (wmem, rmem), no huge effect. > - changing the block sizes used by dd, no effect. > - Testing network throughput with ib_write_bw, this revealed something like: > #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] > 2 5000 19.73 19.30 10.118121 > 4 5000 52.79 51.70 13.553412 > 8 5000 101.23 96.65 12.668371 > 16 5000 243.66 233.42 15.297583 > 32 5000 350.66 344.73 11.296089 > 64 5000 909.14 324.85 5.322323 > 128 5000 1424.84 1401.29 11.479374 > 256 5000 2865.24 2801.04 11.473055 > 512 5000 5169.98 5095.08 10.434733 > 1024 5000 10022.75 9791.42 10.026410 > 2048 5000 10988.64 10628.83 5.441958 > 4096 5000 11401.40 11399.14 2.918180 > [...] > > So it seems the IP-over-Infiniband is not the bottleneck (BeeGFS was using RDMA). > Other ideas that come to mind: > - Testing with Ceph-RDMA, but that does not seem production-ready yet, if I read the list correctly. > - Increasing osd_pool_erasure_code_stripe_width. > - Using ISA as EC plugin. > - Reducing the bluestore_cache_size_hdd, it seems when recovery + benchmark is ongoing, swap is used (but not when performing benchmarking only, > so this should not explain the slowdown). > > However, since we are just beginning with Ceph, it may well be we are missing something basic, but crucial here. > For example, could it be that the block-db storage is too small? How to find out? > > Do any ideas come to mind? > > A second, hopefully easier question: > If one OSD-host fails in our setup, all PGs are changed to "active+clean+remapped" and lots of data is moved. > I understand the remapping is needed, but why is data actually moved? With k=4 and m=2, failure domain=host, > and 6 hosts of which one is down, there should be no advantage for redundancy by moving data around after one host gone down - or do I miss something here? > > Cheers and many thanks in advance, > Oliver > > > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com