Re: Ceph Bluestore performance question

Stijn De Weirdt <stijn.deweirdt@xxxxxxxx> · Sun, 18 Feb 2018 16:32:10 +0100

hi oliver,

the IPoIB network is not 56gb, it's probably a lot less (20gb or so).
the ib_write_bw test is verbs/rdma based. do you have iperf tests
between hosts, and if so, can you share those reuslts?

stijn

> we are just getting started with our first Ceph cluster (Luminous 12.2.2) and doing some basic benchmarking. 
> 
> We have two pools:
> - cephfs_metadata, living on 4 SSD devices (each is a bluestore OSD, 240 GB) on 2 hosts (i.e. 2 SSDs each), setup as:
>   - replicated, min size 2, max size 4
>   - 128 PGs
> - cephfs_data,     living on 6 hosts each of which has the following setup:
>   - 32 HDD drives (4 TB) each of which is a bluestore OSD, the LSI controller to which they are attached is in JBOD personality
>   - 2 SSD drives, each has 16 partitions with 7 GB per partition, used as block-db by the bluestore OSDs living on the HDDs. 
>   - Created with:
>     ceph osd erasure-code-profile set cephfs_data k=4 m=2 crush-device-class=hdd crush-failure-domain=host
>     ceph osd pool create cephfs_data 2048 2048 erasure cephfs_data
>   - So to summarize: 192 OSDs, 2048 PGs, each OSD has 4 TB data + 7 GB block-db
> 
> The interconnect (public and cluster network) 
> is made via IP over Infiniband (56 GBit bandwidth), using the software stack that comes with CentOS 7. 
> 
> This leaves us with the possibility that one of the metadata-hosts can fail, and still one of the disks can fail. 
> For the data hosts, up to two machines total can fail. 
> 
> We have 40 clients connected to this cluster. We now run something like:
> dd if=/dev/zero of=some_file bs=1M count=10000
> on each CPU core of each of the clients, yielding a total of 1120 writing processes (all 40 clients have 28+28HT cores),
> using the ceph-fuse client. 
> 
> This yields a write throughput of a bit below 1 GB/s (capital B), which is unexpectedly low. 
> Running a BeeGFS on the same cluster before (disks were in RAID 6 in that case) yielded throughputs of about 12 GB/s,
> but came with other issues (e.g. it's not FOSS...), so we'd love to run Ceph :-). 
> 
> I performed some basic tests to try to understand the bottleneck for Ceph:
> # rados bench -p cephfs_data 10 write --no-cleanup -t 40
> Bandwidth (MB/sec):     695.952
> Stddev Bandwidth:       295.223
> Max bandwidth (MB/sec): 1088
> Min bandwidth (MB/sec): 76
> Average IOPS:           173
> Stddev IOPS:            73
> Max IOPS:               272
> Min IOPS:               19
> Average Latency(s):     0.220967
> Stddev Latency(s):      0.305967
> Max latency(s):         2.88931
> Min latency(s):         0.0741061
> 
> => This agrees mostly with our basic dd benchmark. 
> 
> Reading is a bit faster:
> # rados bench -p cephfs_data 10 rand
> => Bandwidth (MB/sec):   1108.75
> 
> However, the disks are reasonably quick:
> # ceph tell osd.0 bench
> {
>     "bytes_written": 1073741824,
>     "blocksize": 4194304,
>     "bytes_per_sec": 331850403
> }
> 
> I checked and the OSD-hosts peaked at a load average of about 22 (they have 24+24HT cores) in our dd benchmark,
> but stayed well below that (only about 20 % per OSD daemon) in the rados bench test. 
> One idea would be to switch from jerasure to ISA, since the machines are all Intel CPUs only anyways. 
> 
> Already tried: 
> - TCP stack tuning (wmem, rmem), no huge effect. 
> - changing the block sizes used by dd, no effect. 
> - Testing network throughput with ib_write_bw, this revealed something like:
>  #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
>  2          5000             19.73              19.30              10.118121
>  4          5000             52.79              51.70              13.553412
>  8          5000             101.23             96.65              12.668371      
>  16         5000             243.66             233.42             15.297583
>  32         5000             350.66             344.73             11.296089
>  64         5000             909.14             324.85             5.322323
>  128        5000             1424.84            1401.29            11.479374
>  256        5000             2865.24            2801.04            11.473055
>  512        5000             5169.98            5095.08            10.434733
>  1024       5000             10022.75            9791.42                   10.026410
>  2048       5000             10988.64            10628.83                  5.441958
>  4096       5000             11401.40            11399.14                  2.918180
> [...]
> 
> So it seems the IP-over-Infiniband is not the bottleneck (BeeGFS was using RDMA). 
> Other ideas that come to mind:
> - Testing with Ceph-RDMA, but that does not seem production-ready yet, if I read the list correctly. 
> - Increasing osd_pool_erasure_code_stripe_width. 
> - Using ISA as EC plugin. 
> - Reducing the bluestore_cache_size_hdd, it seems when recovery + benchmark is ongoing, swap is used (but not when performing benchmarking only,
>   so this should not explain the slowdown). 
> 
> However, since we are just beginning with Ceph, it may well be we are missing something basic, but crucial here. 
> For example, could it be that the block-db storage is too small? How to find out? 
> 
> Do any ideas come to mind? 
> 
> A second, hopefully easier question:
> If one OSD-host fails in our setup, all PGs are changed to "active+clean+remapped" and lots of data is moved. 
> I understand the remapping is needed, but why is data actually moved? With k=4 and m=2, failure domain=host, 
> and 6 hosts of which one is down, there should be no advantage for redundancy by moving data around after one host gone down - or do I miss something here? 
> 
> Cheers and many thanks in advance, 
> 	Oliver
> 
> 
> 
> 
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com