Re: Squeezing Performance of CEPH

Ashley Merrick <ashley@xxxxxxxxxxxxxx> · Thu, 22 Jun 2017 17:47:20 +0000

Hello,

Also as Mark put, one minute your testing bandwidth capacity, next minute your testing disk capacity.

No way is a small set of SSD’s going to be able to max your current bandwidth, even if you removed the CEPH / Journal overhead. I would say the speeds you are getting are what you should expect , see with many other setups.

,Ashley

Sent from my iPhone

On 23 Jun 2017, at 12:42 AM, Mark Nelson <mnelson@xxxxxxxxxx> wrote:

Hello Massimiliano,

Based on the configuration below, it appears you have 8 SSDs total (2 nodes with 4 SSDs each)?

I'm going to assume you have 3x replication and are you using filestore, so in reality you are writing 3 copies and doing full data journaling for each copy, for 6x writes per client write.  Taking this into account, your per-SSD throughput should be
 somewhere around:

Sequential write:

~600 * 3 (copies) * 2 (journal write per copy) / 8 (ssds) = ~450MB/s

Sequential read

~3000 / 8 (ssds) = ~375MB/s

Random read

~3337 / 8 (ssds) = ~417MB/s

These numbers are pretty reasonable for SATA based SSDs, though the read throughput is a little low.  You didn't include the model of SSD, but if you look at Intel's DC S3700 which is a fairly popular SSD for ceph:

https://www.intel.com/content/www/us/en/solid-state-drives/ssd-dc-s3700-spec.html

Sequential read is up to ~500MB/s and Sequential write speeds up to 460MB/s.  Not too far off from what you are seeing.  You might try playing with readahead on the OSD devices to see if that improves things at all.  Still, unless I've missed something
 these numbers aren't terrible.

Mark

On 06/22/2017 12:19 PM, Massimiliano Cuttini wrote:

Hi everybody,

I want to squeeze all the performance of CEPH (we are using jewel 10.2.7).

We are testing a testing environment with 2 nodes having the same

configuration:

 * CentOS 7.3

 * 24 CPUs (12 for real in hyper threading)

 * 32Gb of RAM

 * 2x 100Gbit/s ethernet cards

 * 2x OS dedicated in raid SSD Disks

 * 4x OSD SSD Disks SATA 6Gbit/s

We are already expecting the following bottlenecks:

 * [ SATA speed x n° disks ] = 24Gbit/s

 * [ Networks speed x n° bonded cards ] = 200Gbit/s

So the minimum between them is 24 Gbit/s per node (not taking in account

protocol loss).

24Gbit/s per node x2 = 48Gbit/s of maximum hypotetical theorical gross

speed.

Here are the tests:

///////IPERF2/////// Tests are quite good scoring 88% of the bottleneck.

Note: iperf2 can use only 1 connection from a bond.(it's a well know issue).

   [ ID] Interval       Transfer     Bandwidth

   [ 12]  0.0-10.0 sec  9.55 GBytes  8.21 Gbits/sec

   [  3]  0.0-10.0 sec  10.3 GBytes  8.81 Gbits/sec

   [  5]  0.0-10.0 sec  9.54 GBytes  8.19 Gbits/sec

   [  7]  0.0-10.0 sec  9.52 GBytes  8.18 Gbits/sec

   [  6]  0.0-10.0 sec  9.96 GBytes  8.56 Gbits/sec

   [  8]  0.0-10.0 sec  12.1 GBytes  10.4 Gbits/sec

   [  9]  0.0-10.0 sec  12.3 GBytes  10.6 Gbits/sec

   [ 10]  0.0-10.0 sec  10.2 GBytes  8.80 Gbits/sec

   [ 11]  0.0-10.0 sec  9.34 GBytes  8.02 Gbits/sec

   [  4]  0.0-10.0 sec  10.3 GBytes  8.82 Gbits/sec

   [SUM]  0.0-10.0 sec   103 GBytes  88.6 Gbits/sec

///////RADOS BENCH

Take in consideration the maximum hypotetical speed of 48Gbit/s tests

(due to disks bottleneck), tests are not good enought.

 * Average MB/s in write is almost 5-7Gbit/sec (12,5% of the mhs)

 * Average MB/s in seq read is almost 24Gbit/sec (50% of the mhs)

 * Average MB/s in random read is almost 27Gbit/se (56,25% of the mhs).

Here are the reports.

Write:

   # rados bench -p scbench 10 write --no-cleanup

   Total time run:         10.229369

   Total writes made:      1538

   Write size:             4194304

   Object size:            4194304

   Bandwidth (MB/sec):     601.406

   Stddev Bandwidth:       357.012

   Max bandwidth (MB/sec): 1080

   Min bandwidth (MB/sec): 204

   Average IOPS:           150

   Stddev IOPS:            89

   Max IOPS:               270

   Min IOPS:               51

   Average Latency(s):     0.106218

   Stddev Latency(s):      0.198735

   Max latency(s):         1.87401

   Min latency(s):         0.0225438

sequential read:

   # rados bench -p scbench 10 seq

   Total time run:       2.054359

   Total reads made:     1538

   Read size:            4194304

   Object size:          4194304

   Bandwidth (MB/sec):   2994.61

   Average IOPS          748

   Stddev IOPS:          67

   Max IOPS:             802

   Min IOPS:             707

   Average Latency(s):   0.0202177

   Max latency(s):       0.223319

   Min latency(s):       0.00589238

random read:

   # rados bench -p scbench 10 rand

   Total time run:       10.036816

   Total reads made:     8375

   Read size:            4194304

   Object size:          4194304

   Bandwidth (MB/sec):   3337.71

   Average IOPS:         834

   Stddev IOPS:          78

   Max IOPS:             927

   Min IOPS:             741

   Average Latency(s):   0.0182707

   Max latency(s):       0.257397

   Min latency(s):       0.00469212

//------------------------------------

It's seems like that there are some bottleneck somewhere that we are

understimating.

Can you help me to found it?

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com