Re: rados block on SSD - performance - how to tune and get insight?

Brett Chancellor <bchancellor@xxxxxxxxxxxxxx> · Thu, 7 Feb 2019 02:41:10 -0500

This seems right. You are doing a single benchmark from a single client. Your limiting factor will be the network latency. For most networks this is between 0.2 and 0.3ms.  if you're trying to test the potential of your cluster, you'll need multiple workers and clients.

On Thu, Feb 7, 2019, 2:17 AM  <jesper@xxxxxxxx wrote:
Hi List

We are in the process of moving to the next usecase for our ceph cluster

(Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and

that works fine.

We're currently on luminous / bluestore, if upgrading is deemed to

change what we're seeing then please let us know.

We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each. Connected

through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and scheduler set to

deadline, nomerges = 1, rotational = 0.

Each disk "should" give approximately 36K IOPS random write and the double

random read.

Pool is setup with a 3x replicaiton. We would like a "scaleout" setup of

well performing SSD block devices - potentially to host databases and

things like that. I ready through this nice document [0], I know the

HW are radically different from mine, but I still think I'm in the

very low end of what 6 x S4510 should be capable of doing.

Since it is IOPS i care about I have lowered block size to 4096 -- 4M

blocksize nicely saturates the NIC's in both directions.

$ sudo rados bench -p scbench -b 4096 10 write --no-cleanup

hints = 1

Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for

up to 10 seconds or 0 objects

Object prefix: benchmark_data_torsk2_11207

  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)

    0       0         0         0         0         0           -           0

    1      16      5857      5841   22.8155   22.8164  0.00238437  0.00273434

    2      15     11768     11753   22.9533   23.0938   0.0028559  0.00271944

    3      16     17264     17248   22.4564   21.4648  0.00246666  0.00278101

    4      16     22857     22841   22.3037   21.8477    0.002716  0.00280023

    5      16     28462     28446   22.2213   21.8945  0.00220186    0.002811

    6      16     34216     34200   22.2635   22.4766  0.00234315  0.00280552

    7      16     39616     39600   22.0962   21.0938  0.00290661  0.00282718

    8      16     45510     45494   22.2118   23.0234   0.0033541  0.00281253

    9      16     50995     50979   22.1243   21.4258  0.00267282  0.00282371

   10      16     56745     56729   22.1577   22.4609  0.00252583   0.0028193

Total time run:         10.002668

Total writes made:      56745

Write size:             4096

Object size:            4096

Bandwidth (MB/sec):     22.1601

Stddev Bandwidth:       0.712297

Max bandwidth (MB/sec): 23.0938

Min bandwidth (MB/sec): 21.0938

Average IOPS:           5672

Stddev IOPS:            182

Max IOPS:               5912

Min IOPS:               5400

Average Latency(s):     0.00281953

Stddev Latency(s):      0.00190771

Max latency(s):         0.0834767

Min latency(s):         0.00120945

Min latency is fine -- but Max latency of 83ms ?

Average IOPS @ 5672 ?

$ sudo rados bench -p scbench  10 rand

hints = 1

  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)

    0       0         0         0         0         0           -           0

    1      15     23329     23314   91.0537   91.0703 0.000349856 0.000679074

    2      16     48555     48539   94.7884   98.5352 0.000499159 0.000652067

    3      16     76193     76177   99.1747   107.961 0.000443877 0.000622775

    4      15    103923    103908   101.459   108.324 0.000678589 0.000609182

    5      15    132720    132705   103.663   112.488 0.000741734 0.000595998

    6      15    161811    161796   105.323   113.637 0.000333166 0.000586323

    7      15    190196    190181   106.115   110.879 0.000612227 0.000582014

    8      15    221155    221140   107.966   120.934 0.000471219 0.000571944

    9      16    251143    251127   108.984   117.137 0.000267528 0.000566659

Total time run:       10.000640

Total reads made:     282097

Read size:            4096

Object size:          4096

Bandwidth (MB/sec):   110.187

Average IOPS:         28207

Stddev IOPS:          2357

Max IOPS:             30959

Min IOPS:             23314

Average Latency(s):   0.000560402

Max latency(s):       0.109804

Min latency(s):       0.000212671

This is also quite far from expected. I have 12GB of memory on the OSD

daemon for caching on each host - close to idle cluster - thus 50GB+ for

caching with a working set of < 6GB .. this should - in this case

not really be bound by the underlying SSD. But if it were:

IOPS/disk * num disks / replication => 95K * 6 / 3 => 190K or 6x off?

No measureable service time in iostat when running tests, thus I have

come to the conclusion that it has to be either client side, the

network path, or the OSD-daemon that deliveres the increasing latency /

decreased IOPS.

Is there any suggestions on how to get more insigths in that?

Has anyone replicated close to the number Micron are reporting on NVMe?

Thanks a log.

[0]

https://www.micron.com/-/media/client/global/documents/products/other-documents/micron_9200_max_ceph_12,-d-,2,-d-,8_luminous_bluestore_reference_architecture.pdf?la=en

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com