Re: Ceph Bluestore performance question

Caspar Smit <casparsmit@xxxxxxxxxxx> · Mon, 19 Feb 2018 10:53:26 +0100

"I checked and the OSD-hosts peaked at a load average of about 22 (they have 24+24HT cores) in our dd benchmark,
but stayed well below that (only about 20 % per OSD daemon) in the rados bench test."

Maybe because your dd test uses bs=1M and rados bench is using 4M as default block size?
Caspar

2018-02-18 16:03 GMT+01:00 Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx>:
Dear Cephalopodians,

we are just getting started with our first Ceph cluster (Luminous 12.2.2) and doing some basic benchmarking.

We have two pools:

- cephfs_metadata, living on 4 SSD devices (each is a bluestore OSD, 240 GB) on 2 hosts (i.e. 2 SSDs each), setup as:

  - replicated, min size 2, max size 4

  - 128 PGs

- cephfs_data,     living on 6 hosts each of which has the following setup:

  - 32 HDD drives (4 TB) each of which is a bluestore OSD, the LSI controller to which they are attached is in JBOD personality

  - 2 SSD drives, each has 16 partitions with 7 GB per partition, used as block-db by the bluestore OSDs living on the HDDs.

  - Created with:

    ceph osd erasure-code-profile set cephfs_data k=4 m=2 crush-device-class=hdd crush-failure-domain=host

    ceph osd pool create cephfs_data 2048 2048 erasure cephfs_data

  - So to summarize: 192 OSDs, 2048 PGs, each OSD has 4 TB data + 7 GB block-db

The interconnect (public and cluster network)

is made via IP over Infiniband (56 GBit bandwidth), using the software stack that comes with CentOS 7.

This leaves us with the possibility that one of the metadata-hosts can fail, and still one of the disks can fail.

For the data hosts, up to two machines total can fail.

We have 40 clients connected to this cluster. We now run something like:

dd if=/dev/zero of=some_file bs=1M count=10000

on each CPU core of each of the clients, yielding a total of 1120 writing processes (all 40 clients have 28+28HT cores),

using the ceph-fuse client.

This yields a write throughput of a bit below 1 GB/s (capital B), which is unexpectedly low.

Running a BeeGFS on the same cluster before (disks were in RAID 6 in that case) yielded throughputs of about 12 GB/s,

but came with other issues (e.g. it's not FOSS...), so we'd love to run Ceph :-).

I performed some basic tests to try to understand the bottleneck for Ceph:

# rados bench -p cephfs_data 10 write --no-cleanup -t 40

Bandwidth (MB/sec):     695.952

Stddev Bandwidth:       295.223

Max bandwidth (MB/sec): 1088

Min bandwidth (MB/sec): 76

Average IOPS:           173

Stddev IOPS:            73

Max IOPS:               272

Min IOPS:               19

Average Latency(s):     0.220967

Stddev Latency(s):      0.305967

Max latency(s):         2.88931

Min latency(s):         0.0741061

=> This agrees mostly with our basic dd benchmark.

Reading is a bit faster:

# rados bench -p cephfs_data 10 rand

=> Bandwidth (MB/sec):   1108.75

However, the disks are reasonably quick:

# ceph tell osd.0 bench

{

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "bytes_per_sec": 331850403

}

I checked and the OSD-hosts peaked at a load average of about 22 (they have 24+24HT cores) in our dd benchmark,

but stayed well below that (only about 20 % per OSD daemon) in the rados bench test.

One idea would be to switch from jerasure to ISA, since the machines are all Intel CPUs only anyways.

Already tried:

- TCP stack tuning (wmem, rmem), no huge effect.

- changing the block sizes used by dd, no effect.

- Testing network throughput with ib_write_bw, this revealed something like:

 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]

 2          5000             19.73              19.30              10.118121

 4          5000             52.79              51.70              13.553412

 8          5000             101.23             96.65              12.668371

 16         5000             243.66             233.42             15.297583

 32         5000             350.66             344.73             11.296089

 64         5000             909.14             324.85             5.322323

 128        5000             1424.84            1401.29            11.479374

 256        5000             2865.24            2801.04            11.473055

 512        5000             5169.98            5095.08            10.434733

 1024       5000             10022.75            9791.42                   10.026410

 2048       5000             10988.64            10628.83                  5.441958

 4096       5000             11401.40            11399.14                  2.918180

[...]

So it seems the IP-over-Infiniband is not the bottleneck (BeeGFS was using RDMA).

Other ideas that come to mind:

- Testing with Ceph-RDMA, but that does not seem production-ready yet, if I read the list correctly.

- Increasing osd_pool_erasure_code_stripe_width.

- Using ISA as EC plugin.

- Reducing the bluestore_cache_size_hdd, it seems when recovery + benchmark is ongoing, swap is used (but not when performing benchmarking only,

  so this should not explain the slowdown).

However, since we are just beginning with Ceph, it may well be we are missing something basic, but crucial here.

For example, could it be that the block-db storage is too small? How to find out?

Do any ideas come to mind?

A second, hopefully easier question:

If one OSD-host fails in our setup, all PGs are changed to "active+clean+remapped" and lots of data is moved.

I understand the remapping is needed, but why is data actually moved? With k=4 and m=2, failure domain=host,

and 6 hosts of which one is down, there should be no advantage for redundancy by moving data around after one host gone down - or do I miss something here?

Cheers and many thanks in advance,

        Oliver

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com