Re: rados block on SSD - performance - how to tune and get insight?

Maged Mokhtar <mmokhtar@xxxxxxxxxxx> · Thu, 7 Feb 2019 15:51:40 +0200

On 07/02/2019 09:17, jesper@xxxxxxxx wrote:
Hi List

We are in the process of moving to the next usecase for our ceph cluster
(Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and
that works fine.

We're currently on luminous / bluestore, if upgrading is deemed to
change what we're seeing then please let us know.

We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each. Connected
through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and scheduler set to
deadline, nomerges = 1, rotational = 0.

Each disk "should" give approximately 36K IOPS random write and the double
random read.

Pool is setup with a 3x replicaiton. We would like a "scaleout" setup of
well performing SSD block devices - potentially to host databases and
things like that. I ready through this nice document [0], I know the
HW are radically different from mine, but I still think I'm in the
very low end of what 6 x S4510 should be capable of doing.

Since it is IOPS i care about I have lowered block size to 4096 -- 4M
blocksize nicely saturates the NIC's in both directions.

$ sudo rados bench -p scbench -b 4096 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for
up to 10 seconds or 0 objects
Object prefix: benchmark_data_torsk2_11207
   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
     0       0         0         0         0         0           -           0
     1      16      5857      5841   22.8155   22.8164  0.00238437  0.00273434
     2      15     11768     11753   22.9533   23.0938   0.0028559  0.00271944
     3      16     17264     17248   22.4564   21.4648  0.00246666  0.00278101
     4      16     22857     22841   22.3037   21.8477    0.002716  0.00280023
     5      16     28462     28446   22.2213   21.8945  0.00220186    0.002811
     6      16     34216     34200   22.2635   22.4766  0.00234315  0.00280552
     7      16     39616     39600   22.0962   21.0938  0.00290661  0.00282718
     8      16     45510     45494   22.2118   23.0234   0.0033541  0.00281253
     9      16     50995     50979   22.1243   21.4258  0.00267282  0.00282371
    10      16     56745     56729   22.1577   22.4609  0.00252583   0.0028193
Total time run:         10.002668
Total writes made:      56745
Write size:             4096
Object size:            4096
Bandwidth (MB/sec):     22.1601
Stddev Bandwidth:       0.712297
Max bandwidth (MB/sec): 23.0938
Min bandwidth (MB/sec): 21.0938
Average IOPS:           5672
Stddev IOPS:            182
Max IOPS:               5912
Min IOPS:               5400
Average Latency(s):     0.00281953
Stddev Latency(s):      0.00190771
Max latency(s):         0.0834767
Min latency(s):         0.00120945

Min latency is fine -- but Max latency of 83ms ?
Average IOPS @ 5672 ?

$ sudo rados bench -p scbench  10 rand
hints = 1
   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
     0       0         0         0         0         0           -           0
     1      15     23329     23314   91.0537   91.0703 0.000349856 0.000679074
     2      16     48555     48539   94.7884   98.5352 0.000499159 0.000652067
     3      16     76193     76177   99.1747   107.961 0.000443877 0.000622775
     4      15    103923    103908   101.459   108.324 0.000678589 0.000609182
     5      15    132720    132705   103.663   112.488 0.000741734 0.000595998
     6      15    161811    161796   105.323   113.637 0.000333166 0.000586323
     7      15    190196    190181   106.115   110.879 0.000612227 0.000582014
     8      15    221155    221140   107.966   120.934 0.000471219 0.000571944
     9      16    251143    251127   108.984   117.137 0.000267528 0.000566659
Total time run:       10.000640
Total reads made:     282097
Read size:            4096
Object size:          4096
Bandwidth (MB/sec):   110.187
Average IOPS:         28207
Stddev IOPS:          2357
Max IOPS:             30959
Min IOPS:             23314
Average Latency(s):   0.000560402
Max latency(s):       0.109804
Min latency(s):       0.000212671

This is also quite far from expected. I have 12GB of memory on the OSD
daemon for caching on each host - close to idle cluster - thus 50GB+ for
caching with a working set of < 6GB .. this should - in this case
not really be bound by the underlying SSD. But if it were:

IOPS/disk * num disks / replication => 95K * 6 / 3 => 190K or 6x off?

No measureable service time in iostat when running tests, thus I have
come to the conclusion that it has to be either client side, the
network path, or the OSD-daemon that deliveres the increasing latency /
decreased IOPS.

Is there any suggestions on how to get more insigths in that?

Has anyone replicated close to the number Micron are reporting on NVMe?

Thanks a log.

[0]
https://www.micron.com/-/media/client/global/documents/products/other-documents/micron_9200_max_ceph_12,-d-,2,-d-,8_luminous_bluestore_reference_architecture.pdf?la=en

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

6k is low as a max write iops value..even for single client. for cluster 
of 3 nodes, we see from 10k to 60k write iops depending on hardware.

can you increase your threads to 64 or 128 via -t parameter

can you run fio with sync=1 on your disks.

can you try with noop scheduler

what is the %utilization on the disks and cpu ?

can you have more than 1 disk per node

/Maged

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com