New Ceph-cluster and performance "questions"

Patrik Martinsson <patrik.martinsson@xxxxxxxxxxxxx> · Mon, 5 Feb 2018 18:15:37 +0000

Hello, 

I'm not a "storage-guy" so please excuse me if I'm missing /
overlooking something obvious. 

My question is in the area "what kind of performance am I to expect
with this setup". We have bought servers, disks and networking for our
future ceph-cluster and are now in our "testing-phase" and I simply
want to understand if our numbers line up, or if we are missing
something obvious. 

Background, 
- cephmon1, DELL R730, 1 x E5-2643, 64 GB 
- cephosd1-6, DELL R730, 1 x E5-2697, 64 GB
- each server is connected to a dedicated 50 Gbe network, with
Mellanox-4 Lx cards (teamed into one interface, team0).  

In our test we only have one monitor. This will of course not be the
case later on. 

Each OSD, has the following SSD's configured as pass-through (not raid
0 through the raid-controller),

- 2 x Dell 1.6TB 2.5" SATA MLC MU 6Gbs SSD (THNSF81D60CSE), only spec I
can find on Dell's homepage says "Data Transfer Rate 600 Mbps"
- 4 x Intel SSD DC S3700 (https://ark.intel.com/products/71916/Intel-SS
D-DC-S3700-Series-800GB-2_5in-SATA-6Gbs-25nm-MLC)
- 3 HDD's, which is uninteresting here. At the moment I'm only
interested in the performance of the SSD-pool.

Ceph-cluster is created with ceph-ansible with "default params" (ie.
have not added / changed anything except the necessary). 

When ceph-cluster is up, we have 54 OSD's (36 SSD, 18HDD). 
The min_size is 3 on the pool.

Rules are created as follows, 

$ > ceph osd crush rule create-replicated ssd-rule default host ssd
$ > ceph osd crush rule create-replicated hdd-rule default host hdd

Testing is done on a separate node (same nic and network though), 

$ > ceph osd pool create ssd-bench 512 512 replicated ssd-rule

$ > ceph osd pool application enable ssd-bench rbd

$ > rbd create ssd-image --size 1T --pool ssd-pool

$ > rbd map ssd-image --pool ssd-bench

$ > mkfs.xfs /dev/rbd/ssd-bench/ssd-image

$ > mount /dev/rbd/ssd-bench/ssd-image /ssd-bench

Fio is then run like this, 
$ > 
actions="read randread write randwrite"
blocksizes="4k 128k 8m"
tmp_dir="/tmp/"

for blocksize in ${blocksizes}; do
  for action in ${actions}; do
    rm -f ${tmp_dir}${action}_${blocksize}_${suffix}
    fio --directory=/ssd-bench \
        --time_based \ 
        --direct=1 \
        --rw=${action} \
        --bs=$blocksize \
        --size=1G \
        --numjobs=100 \
        --runtime=120 \
        --group_reporting \
        --name=testfile \
        --output=${tmp_dir}${action}_${blocksize}_${suffix}
  done
done

After running this, we end up with these numbers 

read_4k         iops : 159266     throughput : 622    MB / sec
randread_4k     iops : 151887     throughput : 593    MB / sec

read_128k       iops : 31705      throughput : 3963.3 MB / sec
randread_128k   iops : 31664      throughput : 3958.5 MB / sec

read_8m         iops : 470        throughput : 3765.5 MB / sec
randread_8m     iops : 463        throughput : 3705.4 MB / sec

write_4k        iops : 50486      throughput : 197    MB / sec
randwrite_4k    iops : 42491      throughput : 165    MB / sec

write_128k      iops : 15907      throughput : 1988.5 MB / sec
randwrite_128k  iops : 15558      throughput : 1944.9 MB / sec

write_8m        iops : 347        throughput : 2781.2 MB / sec
randwrite
_8m    iops : 347        throughput : 2777.2 MB / sec

Ok, if you read all way here, the million dollar question is of course
if the numbers above are in the ballpark of what to expect, or if they
are low. 

The main reason I'm a bit uncertain on the numbers above are, and this
may sound fuzzy but, because we did POC a couple of months ago with (if
I remember the configuration correctly, unfortunately we only saved the
numbers, not the *exact* configuration *sigh* (networking still the
same though)) with fewer OSD's and those numbers were

read 4k          iops : 282303   throughput : 1102.8	MB / sec
(b)
randread 4k	 iops : 253453   throughput : 990.52	MB / sec
(b)

read 128k	 iops : 31298    throughput : 3912	MB / sec (w)
randread 128k	 iops : 9013     throughput : 1126.8	MB /
sec (w)

read 8m	         iops : 405      throughput : 3241.4	MB /
sec (w)
randread 8m	 iops : 369      throughput : 2957.8	MB / sec
(w)

write 4k	 iops : 80644    throughput : 315	MB / sec (b)
randwrite 4k	 iops : 53178    throughput : 207	MB / sec
(b)

write 128k	 iops : 17126    throughput : 2140.8	MB / sec
(b)
randwrite 128k	 iops : 11654    throughput : 2015.9	MB /
sec (b)

write 8m	 iops : 258      throughput : 2067.1	MB / sec
(w)
randwrite 8m     iops : 251      throughput : 1456.9	MB / sec
(w)

Where (b) is higher number and (w) is lower. What I would expect since
adding more OSD's was an increase on *all* numbers. The read_4k_
throughput and iops number in current setup is not even close to the
POC which makes me wonder if these "new" numbers are what they "are
suppose to" or if I'm missing something obvious. 

Ehm, in this new setup we are running with MTU 1500, I think we had the
POC to 9000, but the difference on the read_4k is roughly 400 MB/sec
and I wonder if the MTU will make up for that. 

Is the above a good way of measuring our cluster, or is it better more
reliable ways of measuring it ? 

Is there a way to calculate this "theoretically" (ie with with 6 nodes
and 36 SSD's we should get these numbers) and then compare it to the
reality. Again, not a storage guy and haven't really done this before
so please excuse me for my laymen terms. 

Thanks for Ceph and keep up the awesome work!

Best regards, 
Patrik Martinsson  
Sweden

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com