Re: New Ceph-cluster and performance "questions"

Christian Balzer <chibi@xxxxxxx> · Tue, 6 Feb 2018 10:47:37 +0900

Hello,

> I'm not a "storage-guy" so please excuse me if I'm missing /
> overlooking something obvious. 
> 
> My question is in the area "what kind of performance am I to expect
> with this setup". We have bought servers, disks and networking for our
> future ceph-cluster and are now in our "testing-phase" and I simply
> want to understand if our numbers line up, or if we are missing
> something obvious. 
> 
A myriad of variables will make for a myriad of results, expected and
otherwise.

For example, you say nothing about the Ceph version, how the OSDs are
created (filestore, bluestore, details), OS and kernel (PTI!!) version.

> Background, 
> - cephmon1, DELL R730, 1 x E5-2643, 64 GB 
> - cephosd1-6, DELL R730, 1 x E5-2697, 64 GB
Unless you're planning on having 16 SSDs per node, a CPU with less and
faster cores would be better (see archives). 

In general, you will want to run atop or something similar on your ceph
and client nodes during these tests to see where and if any resources
(CPU, DISK, NET) are getting stressed.

> - each server is connected to a dedicated 50 Gbe network, with
> Mellanox-4 Lx cards (teamed into one interface, team0).  
> 
> In our test we only have one monitor. This will of course not be the
> case later on. 
> 
> Each OSD, has the following SSD's configured as pass-through (not raid
> 0 through the raid-controller),
> 
> - 2 x Dell 1.6TB 2.5" SATA MLC MU 6Gbs SSD (THNSF81D60CSE), only spec I
> can find on Dell's homepage says "Data Transfer Rate 600 Mbps"
> - 4 x Intel SSD DC S3700 (https://ark.intel.com/products/71916/Intel-SS
> D-DC-S3700-Series-800GB-2_5in-SATA-6Gbs-25nm-MLC)
When and where did you get those?
I wonder if they're available again, had 0 luck getting any last year.

> - 3 HDD's, which is uninteresting here. At the moment I'm only
> interested in the performance of the SSD-pool.
> 
> Ceph-cluster is created with ceph-ansible with "default params" (ie.
> have not added / changed anything except the necessary). 
> 
> When ceph-cluster is up, we have 54 OSD's (36 SSD, 18HDD). 
> The min_size is 3 on the pool. 
Any reason for that?
It will make any OSD failure result in a cluster lockup with a size of 3.
Unless you did set your size to 4, in which case you wrecked performance.

> Rules are created as follows, 
> 
> $ > ceph osd crush rule create-replicated ssd-rule default host ssd
> $ > ceph osd crush rule create-replicated hdd-rule default host hdd
> 
> Testing is done on a separate node (same nic and network though), 
> 
> $ > ceph osd pool create ssd-bench 512 512 replicated ssd-rule
> 
> $ > ceph osd pool application enable ssd-bench rbd
> 
> $ > rbd create ssd-image --size 1T --pool ssd-pool
> 
> $ > rbd map ssd-image --pool ssd-bench
> 
> $ > mkfs.xfs /dev/rbd/ssd-bench/ssd-image
> 
> $ > mount /dev/rbd/ssd-bench/ssd-image /ssd-bench
> 
Unless you're planning on using the Ceph cluster in this fashion (kernel
mounted images), you'd be better off testing in an environment that
matches the use case, i.e. from a VM.

> Fio is then run like this, 
> $ > 
> actions="read randread write randwrite"
> blocksizes="4k 128k 8m"
> tmp_dir="/tmp/"
> 
> for blocksize in ${blocksizes}; do
>   for action in ${actions}; do
>     rm -f ${tmp_dir}${action}_${blocksize}_${suffix}
>     fio --directory=/ssd-bench \
>         --time_based \ 
>         --direct=1 \
>         --rw=${action} \
>         --bs=$blocksize \
>         --size=1G \
>         --numjobs=100 \
>         --runtime=120 \
>         --group_reporting \
>         --name=testfile \
>         --output=${tmp_dir}${action}_${blocksize}_${suffix}
>   done
> done
> 
> After running this, we end up with these numbers 
> 
> read_4k         iops : 159266     throughput : 622    MB / sec
> randread_4k     iops : 151887     throughput : 593    MB / sec
> 
These are very nice numbers. 
Too nice, in my book.
I have a test cluster with a cache-tier based on 2 nodes with 3 DC S3610s
400GB each, obviously with size 2 and min_size=1. So just based on that,
it will be faster than a size 3 pool, Jewel with Filestore.
Network is IPoIB (40Gb), so in that aspect similar to yours, 
64k MTU though.
Ceph nodes have E5-2620 v3 @ 2.40GHz CPUs and 32GB RAM.
I've run the following fio (with different rw actions of course) from a
KVM/qemu VM and am also showing how busy the SSDs, OSD processes, qemu
process on the comp node and the fio inside the VM are:
"fio --size=4G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1
--rw=read --name=fiojob --blocksize=4K --iodepth=64"

READ
  read : io=4096.0MB, bw=81361KB/s, iops=20340, runt= 51552msec
SSDs: 0% (pagecache on the nodes), OSDs: 45%, qemu: 330%, fio_in_VM: 19%

RANDREAD
  read : io=4096.0MB, bw=62760KB/s, iops=15689, runt= 66831msec
SSDs: 0% (pagecache on the nodes), OSDs: 50%, qemu: 550%!!, fio_in_VM: 23%

WRITE
  write: io=4096.0MB, bw=256972KB/s, iops=64243, runt= 16322msec
SSDs: 40%, OSDs: 20%, qemu: 150%, fio_in_VM: 45%

RANDWRITE
  write: io=4096.0MB, bw=43981KB/s, iops=10995, runt= 95366msec
SSDs: 38%, OSDs: 250%!!, qemu: 480%, fio_in_VM: 23%

Note especially the OSD CPU usage in the randwrite fio, this is where
faster (and non-powersaving mode) CPUs will be significant. 
I'm not seeing the same level of performance reductions with rand actions
in your results.

We can roughly compare the reads as the SSDs and pool size play little to
no part in it. 
20k *6 (to compensate for your OSD numbers) is 120k, definitely the same
ball park as your 158k.
It doesn't explain the 282k with your old setup, unless the MTU is really
so significant (see below) or other things changed, like more 

For nonrand writes your basically looking at latency (numjobs is
meaningless), so thats why my 62k (remember size 2) are comparable to your
50k or 80k respectively. 
For randwrite the larger amount of OSDs in your case nicely explains the
difference seen.

> read_128k       iops : 31705      throughput : 3963.3 MB / sec
> randread_128k   iops : 31664      throughput : 3958.5 MB / sec
> 
> read_8m         iops : 470        throughput : 3765.5 MB / sec
> randread_8m     iops : 463        throughput : 3705.4 MB / sec
> 
> write_4k        iops : 50486      throughput : 197    MB / sec
> randwrite_4k    iops : 42491      throughput : 165    MB / sec
> 
> write_128k      iops : 15907      throughput : 1988.5 MB / sec
> randwrite_128k  iops : 15558      throughput : 1944.9 MB / sec
> 
> write_8m        iops : 347        throughput : 2781.2 MB / sec
> randwrite
> _8m    iops : 347        throughput : 2777.2 MB / sec
> 
> 
> Ok, if you read all way here, the million dollar question is of course
> if the numbers above are in the ballpark of what to expect, or if they
> are low. 
> 
> The main reason I'm a bit uncertain on the numbers above are, and this
> may sound fuzzy but, because we did POC a couple of months ago with (if
> I remember the configuration correctly, unfortunately we only saved the
> numbers, not the *exact* configuration *sigh* (networking still the
> same though)) with fewer OSD's and those numbers were
> 
Which unfortunately basically means that these results are... questionable
when comparing them with your current setup.

> read 4k          iops : 282303   throughput : 1102.8	MB / sec
> (b)
> randread 4k	 iops : 253453   throughput : 990.52	MB / sec
> (b)
> 
> read 128k	 iops : 31298    throughput : 3912	MB / sec (w)
> randread 128k	 iops : 9013     throughput : 1126.8	MB /
> sec (w)
> 
> read 8m	         iops : 405      throughput : 3241.4	MB /
> sec (w)
> randread 8m	 iops : 369      throughput : 2957.8	MB / sec
> (w)
> 
> write 4k	 iops : 80644    throughput : 315	MB / sec (b)
> randwrite 4k	 iops : 53178    throughput : 207	MB / sec
> (b)
> 
> write 128k	 iops : 17126    throughput : 2140.8	MB / sec
> (b)
> randwrite 128k	 iops : 11654    throughput : 2015.9	MB /
> sec (b)
> 
> write 8m	 iops : 258      throughput : 2067.1	MB / sec
> (w)
> randwrite 8m     iops : 251      throughput : 1456.9	MB / sec
> (w)
> 
> Where (b) is higher number and (w) is lower. What I would expect since
> adding more OSD's was an increase on *all* numbers. The read_4k_
> throughput and iops number in current setup is not even close to the
> POC which makes me wonder if these "new" numbers are what they "are
> suppose to" or if I'm missing something obvious. 
> 
> Ehm, in this new setup we are running with MTU 1500, I think we had the
> POC to 9000, but the difference on the read_4k is roughly 400 MB/sec
> and I wonder if the MTU will make up for that. 
> 
You're in the best position of everybody here to verify this by changing
your test cluster to use the other MTU and compare...

> Is the above a good way of measuring our cluster, or is it better more
> reliable ways of measuring it ? 
> 
See above.
A fio test is definitely a closer thing to reality compared to OSD or
RADOS benches.

> Is there a way to calculate this "theoretically" (ie with with 6 nodes
> and 36 SSD's we should get these numbers) and then compare it to the
> reality. Again, not a storage guy and haven't really done this before
> so please excuse me for my laymen terms. 
> 
People have tried in the past and AFAIR nothing really conclusive came
about, it really is a game of too many variables. 

Regards,

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Rakuten Communications
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com