Hello, On Thu, 7 Feb 2019 08:17:20 +0100 jesper@xxxxxxxx wrote: > Hi List > > We are in the process of moving to the next usecase for our ceph cluster > (Bulk, cheap, slow, erasurecoded, cephfs) storage was the first - and > that works fine. > > We're currently on luminous / bluestore, if upgrading is deemed to > change what we're seeing then please let us know. > > We have 6 OSD hosts, each with a S4510 of 1TB with 1 SSD in each. Connected > through a H700 MegaRaid Perc BBWC, EachDiskRaid0 - and scheduler set to > deadline, nomerges = 1, rotational = 0. > I'd make sure that the endurance of these SSDs is in line with your expected usage. > Each disk "should" give approximately 36K IOPS random write and the double > random read. > Only locally, latency is your enemy. Tell us more about your network. > Pool is setup with a 3x replicaiton. We would like a "scaleout" setup of > well performing SSD block devices - potentially to host databases and > things like that. I ready through this nice document [0], I know the > HW are radically different from mine, but I still think I'm in the > very low end of what 6 x S4510 should be capable of doing. > > Since it is IOPS i care about I have lowered block size to 4096 -- 4M > blocksize nicely saturates the NIC's in both directions. > > rados bench is not the sharpest tool in the shed for this. As it needs to allocate stuff to begin with, amongst other things. And before you go "fio with RBD engine", that had major issues in my experience, too. Your best and most realistic results will come from doing the testing inside a VM (I presume from your use case) or a mounted RBD block device. And then using fio, of course. > $ sudo rados bench -p scbench -b 4096 10 write --no-cleanup > hints = 1 > Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for > up to 10 seconds or 0 objects > Object prefix: benchmark_data_torsk2_11207 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) > 0 0 0 0 0 0 - 0 > 1 16 5857 5841 22.8155 22.8164 0.00238437 0.00273434 > 2 15 11768 11753 22.9533 23.0938 0.0028559 0.00271944 > 3 16 17264 17248 22.4564 21.4648 0.00246666 0.00278101 > 4 16 22857 22841 22.3037 21.8477 0.002716 0.00280023 > 5 16 28462 28446 22.2213 21.8945 0.00220186 0.002811 > 6 16 34216 34200 22.2635 22.4766 0.00234315 0.00280552 > 7 16 39616 39600 22.0962 21.0938 0.00290661 0.00282718 > 8 16 45510 45494 22.2118 23.0234 0.0033541 0.00281253 > 9 16 50995 50979 22.1243 21.4258 0.00267282 0.00282371 > 10 16 56745 56729 22.1577 22.4609 0.00252583 0.0028193 > Total time run: 10.002668 > Total writes made: 56745 > Write size: 4096 > Object size: 4096 > Bandwidth (MB/sec): 22.1601 > Stddev Bandwidth: 0.712297 > Max bandwidth (MB/sec): 23.0938 > Min bandwidth (MB/sec): 21.0938 > Average IOPS: 5672 > Stddev IOPS: 182 > Max IOPS: 5912 > Min IOPS: 5400 > Average Latency(s): 0.00281953 > Stddev Latency(s): 0.00190771 > Max latency(s): 0.0834767 > Min latency(s): 0.00120945 > > Min latency is fine -- but Max latency of 83ms ? Outliers during setup are to be expected and ignored > Average IOPS @ 5672 ? > Plenty of good reasons to come up with that number, yes. > $ sudo rados bench -p scbench 10 rand > hints = 1 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) > 0 0 0 0 0 0 - 0 > 1 15 23329 23314 91.0537 91.0703 0.000349856 0.000679074 > 2 16 48555 48539 94.7884 98.5352 0.000499159 0.000652067 > 3 16 76193 76177 99.1747 107.961 0.000443877 0.000622775 > 4 15 103923 103908 101.459 108.324 0.000678589 0.000609182 > 5 15 132720 132705 103.663 112.488 0.000741734 0.000595998 > 6 15 161811 161796 105.323 113.637 0.000333166 0.000586323 > 7 15 190196 190181 106.115 110.879 0.000612227 0.000582014 > 8 15 221155 221140 107.966 120.934 0.000471219 0.000571944 > 9 16 251143 251127 108.984 117.137 0.000267528 0.000566659 > Total time run: 10.000640 > Total reads made: 282097 > Read size: 4096 > Object size: 4096 > Bandwidth (MB/sec): 110.187 > Average IOPS: 28207 > Stddev IOPS: 2357 > Max IOPS: 30959 > Min IOPS: 23314 > Average Latency(s): 0.000560402 > Max latency(s): 0.109804 > Min latency(s): 0.000212671 > > This is also quite far from expected. I have 12GB of memory on the OSD > daemon for caching on each host - close to idle cluster - thus 50GB+ for > caching with a working set of < 6GB .. this should - in this case > not really be bound by the underlying SSD. Did you adjust the bluestore parameters (whatever they are this week or for your version) to actually use that memory? >But if it were: > > IOPS/disk * num disks / replication => 95K * 6 / 3 => 190K or 6x off? > > No measureable service time in iostat when running tests, thus I have > come to the conclusion that it has to be either client side, the > network path, or the OSD-daemon that deliveres the increasing latency / > decreased IOPS. > Don't use iostat, use atop. Small IOPS are extremely CPU intensive, so atop will give you an insight as to what might be busy besides the actual storage device. Christian > Is there any suggestions on how to get more insigths in that? > > Has anyone replicated close to the number Micron are reporting on NVMe? > > Thanks a log. > > [0] > https://www.micron.com/-/media/client/global/documents/products/other-documents/micron_9200_max_ceph_12,-d-,2,-d-,8_luminous_bluestore_reference_architecture.pdf?la=en > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Rakuten Communications _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com