Hi Christian, First of all, thanks for all the great answers and sorry for the late reply. On Tue, 2018-02-06 at 10:47 +0900, Christian Balzer wrote: > Hello, > > > I'm not a "storage-guy" so please excuse me if I'm missing / > > overlooking something obvious. > > > > My question is in the area "what kind of performance am I to expect > > with this setup". We have bought servers, disks and networking for > > our > > future ceph-cluster and are now in our "testing-phase" and I simply > > want to understand if our numbers line up, or if we are missing > > something obvious. > > > > A myriad of variables will make for a myriad of results, expected and > otherwise. > > For example, you say nothing about the Ceph version, how the OSDs are > created (filestore, bluestore, details), OS and kernel (PTI!!) > version. Good catch, I totally forgot this. $ > ceph version 12.2.1-40.el7cp (c6d85fd953226c9e8168c9abe81f499d66cc2716) luminous (stable), deployed via Red Hat Ceph Storage 3 (ceph-ansible). Bluestore is enabled, and osd_scenario is set to collocated. $ > cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.4 (Maipo) $ > uname -r 3.10.0-693.11.6.el7.x86_64 (PTI *not* disabled at boot) > > Background, > > - cephmon1, DELL R730, 1 x E5-2643, 64 GB > > - cephosd1-6, DELL R730, 1 x E5-2697, 64 GB > > Unless you're planning on having 16 SSDs per node, a CPU with less > and > faster cores would be better (see archives). > > In general, you will want to run atop or something similar on your > ceph > and client nodes during these tests to see where and if any resources > (CPU, DISK, NET) are getting stressed. Understood, thanks! > > - each server is connected to a dedicated 50 Gbe network, with > > Mellanox-4 Lx cards (teamed into one interface, team0). > > > > In our test we only have one monitor. This will of course not be > > the > > case later on. > > > > Each OSD, has the following SSD's configured as pass-through (not > > raid > > 0 through the raid-controller), > > > > - 2 x Dell 1.6TB 2.5" SATA MLC MU 6Gbs SSD (THNSF81D60CSE), only > > spec I > > can find on Dell's homepage says "Data Transfer Rate 600 Mbps" > > - 4 x Intel SSD DC S3700 (https://ark.intel.com/products/71916/Inte > > l-SS > > D-DC-S3700-Series-800GB-2_5in-SATA-6Gbs-25nm-MLC) > > When and where did you get those? > I wonder if they're available again, had 0 luck getting any last > year. It's actually disks that we have had "lying around", no clue where you could get them today. > > - 3 HDD's, which is uninteresting here. At the moment I'm only > > interested in the performance of the SSD-pool. > > > > Ceph-cluster is created with ceph-ansible with "default params" > > (ie. > > have not added / changed anything except the necessary). > > > > When ceph-cluster is up, we have 54 OSD's (36 SSD, 18HDD). > > The min_size is 3 on the pool. > > Any reason for that? > It will make any OSD failure result in a cluster lockup with a size > of 3. > Unless you did set your size to 4, in which case you wrecked > performance. Hm, sorry, what I meant was size=3. Reading the documentation, I'm not sure I understand the difference between size and min_size. > > Rules are created as follows, > > > > $ > ceph osd crush rule create-replicated ssd-rule default host ssd > > $ > ceph osd crush rule create-replicated hdd-rule default host hdd > > > > Testing is done on a separate node (same nic and network though), > > > > $ > ceph osd pool create ssd-bench 512 512 replicated ssd-rule > > > > $ > ceph osd pool application enable ssd-bench rbd > > > > $ > rbd create ssd-image --size 1T --pool ssd-pool > > > > $ > rbd map ssd-image --pool ssd-bench > > > > $ > mkfs.xfs /dev/rbd/ssd-bench/ssd-image > > > > $ > mount /dev/rbd/ssd-bench/ssd-image /ssd-bench > > > > Unless you're planning on using the Ceph cluster in this fashion > (kernel > mounted images), you'd be better off testing in an environment that > matches the use case, i.e. from a VM. Gotcha, thanks! > > Fio is then run like this, > > $ > > > actions="read randread write randwrite" > > blocksizes="4k 128k 8m" > > tmp_dir="/tmp/" > > > > for blocksize in ${blocksizes}; do > > for action in ${actions}; do > > rm -f ${tmp_dir}${action}_${blocksize}_${suffix} > > fio --directory=/ssd-bench \ > > --time_based \ > > --direct=1 \ > > --rw=${action} \ > > --bs=$blocksize \ > > --size=1G \ > > --numjobs=100 \ > > --runtime=120 \ > > --group_reporting \ > > --name=testfile \ > > --output=${tmp_dir}${action}_${blocksize}_${suffix} > > done > > done > > > > After running this, we end up with these numbers > > > > read_4k iops : 159266 throughput : 622 MB / sec > > randread_4k iops : 151887 throughput : 593 MB / sec > > > > These are very nice numbers. > Too nice, in my book. > I have a test cluster with a cache-tier based on 2 nodes with 3 DC > S3610s > 400GB each, obviously with size 2 and min_size=1. So just based on > that, > it will be faster than a size 3 pool, Jewel with Filestore. > Network is IPoIB (40Gb), so in that aspect similar to yours, > 64k MTU though. > Ceph nodes have E5-2620 v3 @ 2.40GHz CPUs and 32GB RAM. > I've run the following fio (with different rw actions of course) from > a > KVM/qemu VM and am also showing how busy the SSDs, OSD processes, > qemu > process on the comp node and the fio inside the VM are: > "fio --size=4G --ioengine=libaio --invalidate=1 --direct=1 -- > numjobs=1 > --rw=read --name=fiojob --blocksize=4K --iodepth=64" > > READ > read : io=4096.0MB, bw=81361KB/s, iops=20340, runt= 51552msec > SSDs: 0% (pagecache on the nodes), OSDs: 45%, qemu: 330%, fio_in_VM: > 19% > > RANDREAD > read : io=4096.0MB, bw=62760KB/s, iops=15689, runt= 66831msec > SSDs: 0% (pagecache on the nodes), OSDs: 50%, qemu: 550%!!, > fio_in_VM: 23% > > WRITE > write: io=4096.0MB, bw=256972KB/s, iops=64243, runt= 16322msec > SSDs: 40%, OSDs: 20%, qemu: 150%, fio_in_VM: 45% > > RANDWRITE > write: io=4096.0MB, bw=43981KB/s, iops=10995, runt= 95366msec > SSDs: 38%, OSDs: 250%!!, qemu: 480%, fio_in_VM: 23% > > Note especially the OSD CPU usage in the randwrite fio, this is where > faster (and non-powersaving mode) CPUs will be significant. > I'm not seeing the same level of performance reductions with rand > actions > in your results. > > We can roughly compare the reads as the SSDs and pool size play > little to > no part in it. > 20k *6 (to compensate for your OSD numbers) is 120k, definitely the > same > ball park as your 158k. > It doesn't explain the 282k with your old setup, unless the MTU is > really > so significant (see below) or other things changed, like more Thanks for all that- makes sense. I'm not sure I will dig so much deeper into why I got those numbers to begin with - it is a bit annoying though, but since we have little knowledge about the disks and that previous setup, its impossible to compare (since as you say, in the beginning "myriad of variables will make for a myriad of results"). > For nonrand writes your basically looking at latency (numjobs is > meaningless), so thats why my 62k (remember size 2) are comparable to > your > 50k or 80k respectively. > For randwrite the larger amount of OSDs in your case nicely explains > the > difference seen. > > > read_128k iops : 31705 throughput : 3963.3 MB / sec > > randread_128k iops : 31664 throughput : 3958.5 MB / sec > > > > read_8m iops : 470 throughput : 3765.5 MB / sec > > randread_8m iops : 463 throughput : 3705.4 MB / sec > > > > write_4k iops : 50486 throughput : 197 MB / sec > > randwrite_4k iops : 42491 throughput : 165 MB / sec > > > > write_128k iops : 15907 throughput : 1988.5 MB / sec > > randwrite_128k iops : 15558 throughput : 1944.9 MB / sec > > > > write_8m iops : 347 throughput : 2781.2 MB / sec > > randwrite > > _8m iops : 347 throughput : 2777.2 MB / sec > > > > > > Ok, if you read all way here, the million dollar question is of > > course > > if the numbers above are in the ballpark of what to expect, or if > > they > > are low. > > > > The main reason I'm a bit uncertain on the numbers above are, and > > this > > may sound fuzzy but, because we did POC a couple of months ago with > > (if > > I remember the configuration correctly, unfortunately we only saved > > the > > numbers, not the *exact* configuration *sigh* (networking still the > > same though)) with fewer OSD's and those numbers were > > Thanks again. > Which unfortunately basically means that these results are... > questionable > when comparing them with your current setup. > > > read 4k iops : 282303 throughput : 1102.8 MB / > > sec > > (b) > > randread 4k iops : 253453 throughput : 990.52 MB / > > sec > > (b) > > > > read 128k iops : 31298 throughput : 3912 MB / sec > > (w) > > randread 128k iops : 9013 throughput : 1126.8 MB > > / > > sec (w) > > > > read 8m iops : 405 throughput : 3241.4 > > MB / > > sec (w) > > randread 8m iops : 369 throughput : 2957.8 MB / > > sec > > (w) > > > > write 4k iops : 80644 throughput : 315 MB / sec > > (b) > > randwrite 4k iops : 53178 throughput : 207 MB / > > sec > > (b) > > > > write 128k iops : 17126 throughput : 2140.8 MB / > > sec > > (b) > > randwrite 128k iops : 11654 throughput : 2015.9 M > > B / > > sec (b) > > > > write 8m iops : 258 throughput : 2067.1 MB / > > sec > > (w) > > randwrite 8m iops : 251 throughput : 1456.9 MB / > > sec > > (w) > > > > Where (b) is higher number and (w) is lower. What I would expect > > since > > adding more OSD's was an increase on *all* numbers. The read_4k_ > > throughput and iops number in current setup is not even close to > > the > > POC which makes me wonder if these "new" numbers are what they "are > > suppose to" or if I'm missing something obvious. > > > > Ehm, in this new setup we are running with MTU 1500, I think we had > > the > > POC to 9000, but the difference on the read_4k is roughly 400 > > MB/sec > > and I wonder if the MTU will make up for that. > > > > You're in the best position of everybody here to verify this by > changing > your test cluster to use the other MTU and compare... Yes, we will do some more benchmarks and monitor the results. > > Is the above a good way of measuring our cluster, or is it better > > more > > reliable ways of measuring it ? > > > > See above. > A fio test is definitely a closer thing to reality compared to OSD or > RADOS benches. > > > Is there a way to calculate this "theoretically" (ie with with 6 > > nodes > > and 36 SSD's we should get these numbers) and then compare it to > > the > > reality. Again, not a storage guy and haven't really done this > > before > > so please excuse me for my laymen terms. > > > > People have tried in the past and AFAIR nothing really conclusive > came > about, it really is a game of too many variables. > > Regards, > > Christian Again, thanks for everything, nicely explained. Not sure if it could be of interest for anyone, but I took a screenshot of our fio-diagrams generated in confluence and put it here, https://im gur.com/a/PaMLg Basically the only interesting bars are the yellow (test #3) and the purple (test #4), as those are the ones where I actually know the exact configuration. Interesting to see that enabling the raid controller and putting all disks in raid 0 (disk cache disabled) vs. pass through yielded a quite big enhance in the write128k/write8m and randwrite128k/randwrite8m areas. Whereas in the other areas, there aren't that much of a difference. So in my opinion, raid 0 would be the way to go - however I see some different opinions about this. Red Hat talks about this in the following pdf, https://www.redhat.com/cms/managed-files/st-rhcs-config- guide-technology-detail-inc0387897-201604-en.pdf Any thoughts about this ? Putting disks through the raid-controller also messes up the automatic type-classification that ceph does, which is annoying - but "workaroundable". As I understand it, ceph determines the disk class by by looking at the value in /sys/block/<disk>/queue/rotational (1 is hdd, 0 ssd). This value gets set correctly when using "pass through (non-raid)", but when using raid 0, this gets set to 1 even though its ssd's. We workaround this by using the following udev-rule, where "sd[a-c]" would be the ssd-disks. $ > echo 'ACTION=="add|change", KERNEL=="sd[a-c]", ATTR{queue/rotational}="0"' >> /etc/udev/rules.d/10-ssd- persistent.rules Maybe this has been mentioned, but I'm curious on why this happens, anyone knows ? Again, thanks for all the great work. Best regards, Patrik Sweden _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com