Hello, > I'm not a "storage-guy" so please excuse me if I'm missing / > overlooking something obvious. > > My question is in the area "what kind of performance am I to expect > with this setup". We have bought servers, disks and networking for our > future ceph-cluster and are now in our "testing-phase" and I simply > want to understand if our numbers line up, or if we are missing > something obvious. > A myriad of variables will make for a myriad of results, expected and otherwise. For example, you say nothing about the Ceph version, how the OSDs are created (filestore, bluestore, details), OS and kernel (PTI!!) version. > Background, > - cephmon1, DELL R730, 1 x E5-2643, 64 GB > - cephosd1-6, DELL R730, 1 x E5-2697, 64 GB Unless you're planning on having 16 SSDs per node, a CPU with less and faster cores would be better (see archives). In general, you will want to run atop or something similar on your ceph and client nodes during these tests to see where and if any resources (CPU, DISK, NET) are getting stressed. > - each server is connected to a dedicated 50 Gbe network, with > Mellanox-4 Lx cards (teamed into one interface, team0). > > In our test we only have one monitor. This will of course not be the > case later on. > > Each OSD, has the following SSD's configured as pass-through (not raid > 0 through the raid-controller), > > - 2 x Dell 1.6TB 2.5" SATA MLC MU 6Gbs SSD (THNSF81D60CSE), only spec I > can find on Dell's homepage says "Data Transfer Rate 600 Mbps" > - 4 x Intel SSD DC S3700 (https://ark.intel.com/products/71916/Intel-SS > D-DC-S3700-Series-800GB-2_5in-SATA-6Gbs-25nm-MLC) When and where did you get those? I wonder if they're available again, had 0 luck getting any last year. > - 3 HDD's, which is uninteresting here. At the moment I'm only > interested in the performance of the SSD-pool. > > Ceph-cluster is created with ceph-ansible with "default params" (ie. > have not added / changed anything except the necessary). > > When ceph-cluster is up, we have 54 OSD's (36 SSD, 18HDD). > The min_size is 3 on the pool. Any reason for that? It will make any OSD failure result in a cluster lockup with a size of 3. Unless you did set your size to 4, in which case you wrecked performance. > Rules are created as follows, > > $ > ceph osd crush rule create-replicated ssd-rule default host ssd > $ > ceph osd crush rule create-replicated hdd-rule default host hdd > > Testing is done on a separate node (same nic and network though), > > $ > ceph osd pool create ssd-bench 512 512 replicated ssd-rule > > $ > ceph osd pool application enable ssd-bench rbd > > $ > rbd create ssd-image --size 1T --pool ssd-pool > > $ > rbd map ssd-image --pool ssd-bench > > $ > mkfs.xfs /dev/rbd/ssd-bench/ssd-image > > $ > mount /dev/rbd/ssd-bench/ssd-image /ssd-bench > Unless you're planning on using the Ceph cluster in this fashion (kernel mounted images), you'd be better off testing in an environment that matches the use case, i.e. from a VM. > Fio is then run like this, > $ > > actions="read randread write randwrite" > blocksizes="4k 128k 8m" > tmp_dir="/tmp/" > > for blocksize in ${blocksizes}; do > for action in ${actions}; do > rm -f ${tmp_dir}${action}_${blocksize}_${suffix} > fio --directory=/ssd-bench \ > --time_based \ > --direct=1 \ > --rw=${action} \ > --bs=$blocksize \ > --size=1G \ > --numjobs=100 \ > --runtime=120 \ > --group_reporting \ > --name=testfile \ > --output=${tmp_dir}${action}_${blocksize}_${suffix} > done > done > > After running this, we end up with these numbers > > read_4k iops : 159266 throughput : 622 MB / sec > randread_4k iops : 151887 throughput : 593 MB / sec > These are very nice numbers. Too nice, in my book. I have a test cluster with a cache-tier based on 2 nodes with 3 DC S3610s 400GB each, obviously with size 2 and min_size=1. So just based on that, it will be faster than a size 3 pool, Jewel with Filestore. Network is IPoIB (40Gb), so in that aspect similar to yours, 64k MTU though. Ceph nodes have E5-2620 v3 @ 2.40GHz CPUs and 32GB RAM. I've run the following fio (with different rw actions of course) from a KVM/qemu VM and am also showing how busy the SSDs, OSD processes, qemu process on the comp node and the fio inside the VM are: "fio --size=4G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=read --name=fiojob --blocksize=4K --iodepth=64" READ read : io=4096.0MB, bw=81361KB/s, iops=20340, runt= 51552msec SSDs: 0% (pagecache on the nodes), OSDs: 45%, qemu: 330%, fio_in_VM: 19% RANDREAD read : io=4096.0MB, bw=62760KB/s, iops=15689, runt= 66831msec SSDs: 0% (pagecache on the nodes), OSDs: 50%, qemu: 550%!!, fio_in_VM: 23% WRITE write: io=4096.0MB, bw=256972KB/s, iops=64243, runt= 16322msec SSDs: 40%, OSDs: 20%, qemu: 150%, fio_in_VM: 45% RANDWRITE write: io=4096.0MB, bw=43981KB/s, iops=10995, runt= 95366msec SSDs: 38%, OSDs: 250%!!, qemu: 480%, fio_in_VM: 23% Note especially the OSD CPU usage in the randwrite fio, this is where faster (and non-powersaving mode) CPUs will be significant. I'm not seeing the same level of performance reductions with rand actions in your results. We can roughly compare the reads as the SSDs and pool size play little to no part in it. 20k *6 (to compensate for your OSD numbers) is 120k, definitely the same ball park as your 158k. It doesn't explain the 282k with your old setup, unless the MTU is really so significant (see below) or other things changed, like more For nonrand writes your basically looking at latency (numjobs is meaningless), so thats why my 62k (remember size 2) are comparable to your 50k or 80k respectively. For randwrite the larger amount of OSDs in your case nicely explains the difference seen. > read_128k iops : 31705 throughput : 3963.3 MB / sec > randread_128k iops : 31664 throughput : 3958.5 MB / sec > > read_8m iops : 470 throughput : 3765.5 MB / sec > randread_8m iops : 463 throughput : 3705.4 MB / sec > > write_4k iops : 50486 throughput : 197 MB / sec > randwrite_4k iops : 42491 throughput : 165 MB / sec > > write_128k iops : 15907 throughput : 1988.5 MB / sec > randwrite_128k iops : 15558 throughput : 1944.9 MB / sec > > write_8m iops : 347 throughput : 2781.2 MB / sec > randwrite > _8m iops : 347 throughput : 2777.2 MB / sec > > > Ok, if you read all way here, the million dollar question is of course > if the numbers above are in the ballpark of what to expect, or if they > are low. > > The main reason I'm a bit uncertain on the numbers above are, and this > may sound fuzzy but, because we did POC a couple of months ago with (if > I remember the configuration correctly, unfortunately we only saved the > numbers, not the *exact* configuration *sigh* (networking still the > same though)) with fewer OSD's and those numbers were > Which unfortunately basically means that these results are... questionable when comparing them with your current setup. > read 4k iops : 282303 throughput : 1102.8 MB / sec > (b) > randread 4k iops : 253453 throughput : 990.52 MB / sec > (b) > > read 128k iops : 31298 throughput : 3912 MB / sec (w) > randread 128k iops : 9013 throughput : 1126.8 MB / > sec (w) > > read 8m iops : 405 throughput : 3241.4 MB / > sec (w) > randread 8m iops : 369 throughput : 2957.8 MB / sec > (w) > > write 4k iops : 80644 throughput : 315 MB / sec (b) > randwrite 4k iops : 53178 throughput : 207 MB / sec > (b) > > write 128k iops : 17126 throughput : 2140.8 MB / sec > (b) > randwrite 128k iops : 11654 throughput : 2015.9 MB / > sec (b) > > write 8m iops : 258 throughput : 2067.1 MB / sec > (w) > randwrite 8m iops : 251 throughput : 1456.9 MB / sec > (w) > > Where (b) is higher number and (w) is lower. What I would expect since > adding more OSD's was an increase on *all* numbers. The read_4k_ > throughput and iops number in current setup is not even close to the > POC which makes me wonder if these "new" numbers are what they "are > suppose to" or if I'm missing something obvious. > > Ehm, in this new setup we are running with MTU 1500, I think we had the > POC to 9000, but the difference on the read_4k is roughly 400 MB/sec > and I wonder if the MTU will make up for that. > You're in the best position of everybody here to verify this by changing your test cluster to use the other MTU and compare... > Is the above a good way of measuring our cluster, or is it better more > reliable ways of measuring it ? > See above. A fio test is definitely a closer thing to reality compared to OSD or RADOS benches. > Is there a way to calculate this "theoretically" (ie with with 6 nodes > and 36 SSD's we should get these numbers) and then compare it to the > reality. Again, not a storage guy and haven't really done this before > so please excuse me for my laymen terms. > People have tried in the past and AFAIR nothing really conclusive came about, it really is a game of too many variables. Regards, Christian -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Rakuten Communications _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com