I did your rados bench test on our sm863a pool 3x rep, got similar results. [@]# rados bench -p fs_data.ssd -b 4096 10 write --no-cleanup hints = 1 Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for up to 10 seconds or 0 objects Object prefix: benchmark_data_c04_1337712 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 16 6302 6286 24.5533 24.5547 0.00304773 0.002541 2 15 12545 12530 24.4705 24.3906 0.00228294 0.0025506 3 16 18675 18659 24.2933 23.9414 0.00332918 0.00257042 4 16 25194 25178 24.5854 25.4648 0.0034176 0.00254016 5 16 31657 31641 24.7169 25.2461 0.00156494 0.00252686 6 16 37713 37697 24.5398 23.6562 0.00228134 0.00254527 7 16 43848 43832 24.4572 23.9648 0.00238393 0.00255401 8 16 49516 49500 24.1673 22.1406 0.00244473 0.00258466 9 16 55562 55546 24.1059 23.6172 0.00249619 0.00259139 10 16 61675 61659 24.0829 23.8789 0.0020192 0.00259362 Total time run: 10.002179 Total writes made: 61675 Write size: 4096 Object size: 4096 Bandwidth (MB/sec): 24.0865 Stddev Bandwidth: 0.932554 Max bandwidth (MB/sec): 25.4648 Min bandwidth (MB/sec): 22.1406 Average IOPS: 6166 Stddev IOPS: 238 Max IOPS: 6519 Min IOPS: 5668 Average Latency(s): 0.00259383 Stddev Latency(s): 0.00173856 Max latency(s): 0.0778051 Min latency(s): 0.00110931 [@ ]# rados bench -p fs_data.ssd 10 rand hints = 1 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 15 27697 27682 108.115 108.133 0.000755936 0.000568212 2 15 57975 57960 113.186 118.273 0.000547682 0.000542773 3 15 88500 88485 115.199 119.238 0.00036749 0.000533185 4 15 117199 117184 114.422 112.105 0.000354388 0.000536647 5 15 147734 147719 115.39 119.277 0.000419781 0.00053221 6 16 176393 176377 114.814 111.945 0.000427109 0.000534771 7 15 203693 203678 113.645 106.645 0.000379089 0.000540113 8 15 231917 231902 113.219 110.25 0.000465232 0.000542156 9 16 261054 261038 113.284 113.812 0.000358025 0.000541972 Total time run: 10.000669 Total reads made: 290371 Read size: 4096 Object size: 4096 Bandwidth (MB/sec): 113.419 Average IOPS: 29035 Stddev IOPS: 1212 Max IOPS: 30535 Min IOPS: 27301 Average Latency(s): 0.000541371 Max latency(s): 0.00380609 Min latency(s): 0.000155521 -----Original Message----- From: jesper@xxxxxxxx [mailto:jesper@xxxxxxxx] Sent: 07 February 2019 08:17 To: ceph-users@xxxxxxxxxxxxxx Subject: rados block on SSD - performance - how to tune and get insight? Hi List We are in the process of moving to the next usecase for our ceph cluster (Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and that works fine. We're currently on luminous / bluestore, if upgrading is deemed to change what we're seeing then please let us know. We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each. Connected through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and scheduler set to deadline, nomerges = 1, rotational = 0. Each disk "should" give approximately 36K IOPS random write and the double random read. Pool is setup with a 3x replicaiton. We would like a "scaleout" setup of well performing SSD block devices - potentially to host databases and things like that. I ready through this nice document [0], I know the HW are radically different from mine, but I still think I'm in the very low end of what 6 x S4510 should be capable of doing. Since it is IOPS i care about I have lowered block size to 4096 -- 4M blocksize nicely saturates the NIC's in both directions. $ sudo rados bench -p scbench -b 4096 10 write --no-cleanup hints = 1 Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for up to 10 seconds or 0 objects Object prefix: benchmark_data_torsk2_11207 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 16 5857 5841 22.8155 22.8164 0.00238437 0.00273434 2 15 11768 11753 22.9533 23.0938 0.0028559 0.00271944 3 16 17264 17248 22.4564 21.4648 0.00246666 0.00278101 4 16 22857 22841 22.3037 21.8477 0.002716 0.00280023 5 16 28462 28446 22.2213 21.8945 0.00220186 0.002811 6 16 34216 34200 22.2635 22.4766 0.00234315 0.00280552 7 16 39616 39600 22.0962 21.0938 0.00290661 0.00282718 8 16 45510 45494 22.2118 23.0234 0.0033541 0.00281253 9 16 50995 50979 22.1243 21.4258 0.00267282 0.00282371 10 16 56745 56729 22.1577 22.4609 0.00252583 0.0028193 Total time run: 10.002668 Total writes made: 56745 Write size: 4096 Object size: 4096 Bandwidth (MB/sec): 22.1601 Stddev Bandwidth: 0.712297 Max bandwidth (MB/sec): 23.0938 Min bandwidth (MB/sec): 21.0938 Average IOPS: 5672 Stddev IOPS: 182 Max IOPS: 5912 Min IOPS: 5400 Average Latency(s): 0.00281953 Stddev Latency(s): 0.00190771 Max latency(s): 0.0834767 Min latency(s): 0.00120945 Min latency is fine -- but Max latency of 83ms ? Average IOPS @ 5672 ? $ sudo rados bench -p scbench 10 rand hints = 1 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 15 23329 23314 91.0537 91.0703 0.000349856 0.000679074 2 16 48555 48539 94.7884 98.5352 0.000499159 0.000652067 3 16 76193 76177 99.1747 107.961 0.000443877 0.000622775 4 15 103923 103908 101.459 108.324 0.000678589 0.000609182 5 15 132720 132705 103.663 112.488 0.000741734 0.000595998 6 15 161811 161796 105.323 113.637 0.000333166 0.000586323 7 15 190196 190181 106.115 110.879 0.000612227 0.000582014 8 15 221155 221140 107.966 120.934 0.000471219 0.000571944 9 16 251143 251127 108.984 117.137 0.000267528 0.000566659 Total time run: 10.000640 Total reads made: 282097 Read size: 4096 Object size: 4096 Bandwidth (MB/sec): 110.187 Average IOPS: 28207 Stddev IOPS: 2357 Max IOPS: 30959 Min IOPS: 23314 Average Latency(s): 0.000560402 Max latency(s): 0.109804 Min latency(s): 0.000212671 This is also quite far from expected. I have 12GB of memory on the OSD daemon for caching on each host - close to idle cluster - thus 50GB+ for caching with a working set of < 6GB .. this should - in this case not really be bound by the underlying SSD. But if it were: IOPS/disk * num disks / replication => 95K * 6 / 3 => 190K or 6x off? No measureable service time in iostat when running tests, thus I have come to the conclusion that it has to be either client side, the network path, or the OSD-daemon that deliveres the increasing latency / decreased IOPS. Is there any suggestions on how to get more insigths in that? Has anyone replicated close to the number Micron are reporting on NVMe? Thanks a log. [0] https://www.micron.com/-/media/client/global/documents/products/other-documents/micron_9200_max_ceph_12,-d-,2,-d-,8_luminous_bluestore_reference_architecture.pdf?la=en _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com