Re: New Ceph-cluster and performance "questions"

Patrik Martinsson <patrik.martinsson@xxxxxxxxxxxxx> · Thu, 8 Feb 2018 10:58:43 +0000

Hi Christian, 

First of all, thanks for all the great answers and sorry for the late
reply. 

On Tue, 2018-02-06 at 10:47 +0900, Christian Balzer wrote:
> Hello,
> 
> > I'm not a "storage-guy" so please excuse me if I'm missing /
> > overlooking something obvious. 
> > 
> > My question is in the area "what kind of performance am I to expect
> > with this setup". We have bought servers, disks and networking for
> > our
> > future ceph-cluster and are now in our "testing-phase" and I simply
> > want to understand if our numbers line up, or if we are missing
> > something obvious. 
> > 
> 
> A myriad of variables will make for a myriad of results, expected and
> otherwise.
> 
> For example, you say nothing about the Ceph version, how the OSDs are
> created (filestore, bluestore, details), OS and kernel (PTI!!)
> version.

Good catch, I totally forgot this. 

$ > ceph version 12.2.1-40.el7cp
(c6d85fd953226c9e8168c9abe81f499d66cc2716) luminous (stable), deployed
via Red Hat Ceph Storage 3 (ceph-ansible). Bluestore is enabled, and
osd_scenario is set to collocated.

$ > cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.4 (Maipo)

$ > uname -r 
3.10.0-693.11.6.el7.x86_64 (PTI *not* disabled at boot)

> > Background, 
> > - cephmon1, DELL R730, 1 x E5-2643, 64 GB 
> > - cephosd1-6, DELL R730, 1 x E5-2697, 64 GB
> 
> Unless you're planning on having 16 SSDs per node, a CPU with less
> and
> faster cores would be better (see archives). 
> 
> In general, you will want to run atop or something similar on your
> ceph
> and client nodes during these tests to see where and if any resources
> (CPU, DISK, NET) are getting stressed.

Understood, thanks!

> > - each server is connected to a dedicated 50 Gbe network, with
> > Mellanox-4 Lx cards (teamed into one interface, team0).  
> > 
> > In our test we only have one monitor. This will of course not be
> > the
> > case later on. 
> > 
> > Each OSD, has the following SSD's configured as pass-through (not
> > raid
> > 0 through the raid-controller),
> > 
> > - 2 x Dell 1.6TB 2.5" SATA MLC MU 6Gbs SSD (THNSF81D60CSE), only
> > spec I
> > can find on Dell's homepage says "Data Transfer Rate 600 Mbps"
> > - 4 x Intel SSD DC S3700 (https://ark.intel.com/products/71916/Inte
> > l-SS
> > D-DC-S3700-Series-800GB-2_5in-SATA-6Gbs-25nm-MLC)
> 
> When and where did you get those?
> I wonder if they're available again, had 0 luck getting any last
> year.

It's actually disks that we have had "lying around", no clue where you
could get them today. 

> > - 3 HDD's, which is uninteresting here. At the moment I'm only
> > interested in the performance of the SSD-pool.
> > 
> > Ceph-cluster is created with ceph-ansible with "default params"
> > (ie.
> > have not added / changed anything except the necessary). 
> > 
> > When ceph-cluster is up, we have 54 OSD's (36 SSD, 18HDD). 
> > The min_size is 3 on the pool. 
> 
> Any reason for that?
> It will make any OSD failure result in a cluster lockup with a size
> of 3.
> Unless you did set your size to 4, in which case you wrecked
> performance.

Hm, sorry, what I meant was size=3. Reading the documentation, I'm not
sure I understand the difference between size and min_size. 

> > Rules are created as follows, 
> > 
> > $ > ceph osd crush rule create-replicated ssd-rule default host ssd
> > $ > ceph osd crush rule create-replicated hdd-rule default host hdd
> > 
> > Testing is done on a separate node (same nic and network though), 
> > 
> > $ > ceph osd pool create ssd-bench 512 512 replicated ssd-rule
> > 
> > $ > ceph osd pool application enable ssd-bench rbd
> > 
> > $ > rbd create ssd-image --size 1T --pool ssd-pool
> > 
> > $ > rbd map ssd-image --pool ssd-bench
> > 
> > $ > mkfs.xfs /dev/rbd/ssd-bench/ssd-image
> > 
> > $ > mount /dev/rbd/ssd-bench/ssd-image /ssd-bench
> > 
> 
> Unless you're planning on using the Ceph cluster in this fashion
> (kernel
> mounted images), you'd be better off testing in an environment that
> matches the use case, i.e. from a VM.

Gotcha, thanks!

> > Fio is then run like this, 
> > $ > 
> > actions="read randread write randwrite"
> > blocksizes="4k 128k 8m"
> > tmp_dir="/tmp/"
> > 
> > for blocksize in ${blocksizes}; do
> >   for action in ${actions}; do
> >     rm -f ${tmp_dir}${action}_${blocksize}_${suffix}
> >     fio --directory=/ssd-bench \
> >         --time_based \ 
> >         --direct=1 \
> >         --rw=${action} \
> >         --bs=$blocksize \
> >         --size=1G \
> >         --numjobs=100 \
> >         --runtime=120 \
> >         --group_reporting \
> >         --name=testfile \
> >         --output=${tmp_dir}${action}_${blocksize}_${suffix}
> >   done
> > done
> > 
> > After running this, we end up with these numbers 
> > 
> > read_4k         iops : 159266     throughput : 622    MB / sec
> > randread_4k     iops : 151887     throughput : 593    MB / sec
> > 
> 
> These are very nice numbers. 
> Too nice, in my book.
> I have a test cluster with a cache-tier based on 2 nodes with 3 DC
> S3610s
> 400GB each, obviously with size 2 and min_size=1. So just based on
> that,
> it will be faster than a size 3 pool, Jewel with Filestore.
> Network is IPoIB (40Gb), so in that aspect similar to yours, 
> 64k MTU though.
> Ceph nodes have E5-2620 v3 @ 2.40GHz CPUs and 32GB RAM.
> I've run the following fio (with different rw actions of course) from
> a
> KVM/qemu VM and am also showing how busy the SSDs, OSD processes,
> qemu
> process on the comp node and the fio inside the VM are:
> "fio --size=4G --ioengine=libaio --invalidate=1 --direct=1 --
> numjobs=1
> --rw=read --name=fiojob --blocksize=4K --iodepth=64"
> 
> READ
>   read : io=4096.0MB, bw=81361KB/s, iops=20340, runt= 51552msec
> SSDs: 0% (pagecache on the nodes), OSDs: 45%, qemu: 330%, fio_in_VM:
> 19%
> 
> RANDREAD
>   read : io=4096.0MB, bw=62760KB/s, iops=15689, runt= 66831msec
> SSDs: 0% (pagecache on the nodes), OSDs: 50%, qemu: 550%!!,
> fio_in_VM: 23%
> 
> WRITE
>   write: io=4096.0MB, bw=256972KB/s, iops=64243, runt= 16322msec
> SSDs: 40%, OSDs: 20%, qemu: 150%, fio_in_VM: 45%
> 
> RANDWRITE
>   write: io=4096.0MB, bw=43981KB/s, iops=10995, runt= 95366msec
> SSDs: 38%, OSDs: 250%!!, qemu: 480%, fio_in_VM: 23%
> 
> Note especially the OSD CPU usage in the randwrite fio, this is where
> faster (and non-powersaving mode) CPUs will be significant. 
> I'm not seeing the same level of performance reductions with rand
> actions
> in your results.
> 
> We can roughly compare the reads as the SSDs and pool size play
> little to
> no part in it. 
> 20k *6 (to compensate for your OSD numbers) is 120k, definitely the
> same
> ball park as your 158k.
> It doesn't explain the 282k with your old setup, unless the MTU is
> really
> so significant (see below) or other things changed, like more 

Thanks for all that-  makes sense. I'm not sure I will dig so much
deeper into why I got those numbers to begin with - it is a bit
annoying though, but since we have little knowledge about the disks and
that previous setup,  its impossible to compare (since as you say, in
the beginning "myriad of variables will make for a myriad of results").

> For nonrand writes your basically looking at latency (numjobs is
> meaningless), so thats why my 62k (remember size 2) are comparable to
> your
> 50k or 80k respectively. 
> For randwrite the larger amount of OSDs in your case nicely explains
> the
> difference seen.
> 
> > read_128k       iops : 31705      throughput : 3963.3 MB / sec
> > randread_128k   iops : 31664      throughput : 3958.5 MB / sec
> > 
> > read_8m         iops : 470        throughput : 3765.5 MB / sec
> > randread_8m     iops : 463        throughput : 3705.4 MB / sec
> > 
> > write_4k        iops : 50486      throughput : 197    MB / sec
> > randwrite_4k    iops : 42491      throughput : 165    MB / sec
> > 
> > write_128k      iops : 15907      throughput : 1988.5 MB / sec
> > randwrite_128k  iops : 15558      throughput : 1944.9 MB / sec
> > 
> > write_8m        iops : 347        throughput : 2781.2 MB / sec
> > randwrite
> > _8m    iops : 347        throughput : 2777.2 MB / sec
> > 
> > 
> > Ok, if you read all way here, the million dollar question is of
> > course
> > if the numbers above are in the ballpark of what to expect, or if
> > they
> > are low. 
> > 
> > The main reason I'm a bit uncertain on the numbers above are, and
> > this
> > may sound fuzzy but, because we did POC a couple of months ago with
> > (if
> > I remember the configuration correctly, unfortunately we only saved
> > the
> > numbers, not the *exact* configuration *sigh* (networking still the
> > same though)) with fewer OSD's and those numbers were
> > 

Thanks again. 

> Which unfortunately basically means that these results are...
> questionable
> when comparing them with your current setup.
> 
> > read 4k          iops : 282303   throughput : 1102.8	MB /
> > sec
> > (b)
> > randread 4k	 iops : 253453   throughput : 990.52	MB /
> > sec
> > (b)
> > 
> > read 128k	 iops : 31298    throughput : 3912	MB / sec
> > (w)
> > randread 128k	 iops : 9013     throughput : 1126.8	MB
> > /
> > sec (w)
> > 
> > read 8m	         iops : 405      throughput : 3241.4	
> > MB /
> > sec (w)
> > randread 8m	 iops : 369      throughput : 2957.8	MB /
> > sec
> > (w)
> > 
> > write 4k	 iops : 80644    throughput : 315	MB / sec
> > (b)
> > randwrite 4k	 iops : 53178    throughput : 207	MB /
> > sec
> > (b)
> > 
> > write 128k	 iops : 17126    throughput : 2140.8	MB /
> > sec
> > (b)
> > randwrite 128k	 iops : 11654    throughput : 2015.9	M
> > B /
> > sec (b)
> > 
> > write 8m	 iops : 258      throughput : 2067.1	MB /
> > sec
> > (w)
> > randwrite 8m     iops : 251      throughput : 1456.9	MB /
> > sec
> > (w)
> > 
> > Where (b) is higher number and (w) is lower. What I would expect
> > since
> > adding more OSD's was an increase on *all* numbers. The read_4k_
> > throughput and iops number in current setup is not even close to
> > the
> > POC which makes me wonder if these "new" numbers are what they "are
> > suppose to" or if I'm missing something obvious. 
> > 
> > Ehm, in this new setup we are running with MTU 1500, I think we had
> > the
> > POC to 9000, but the difference on the read_4k is roughly 400
> > MB/sec
> > and I wonder if the MTU will make up for that. 
> > 
> 
> You're in the best position of everybody here to verify this by
> changing
> your test cluster to use the other MTU and compare...

Yes, we will do some more benchmarks and monitor the results.

> > Is the above a good way of measuring our cluster, or is it better
> > more
> > reliable ways of measuring it ? 
> > 
> 
> See above.
> A fio test is definitely a closer thing to reality compared to OSD or
> RADOS benches.
> 
> > Is there a way to calculate this "theoretically" (ie with with 6
> > nodes
> > and 36 SSD's we should get these numbers) and then compare it to
> > the
> > reality. Again, not a storage guy and haven't really done this
> > before
> > so please excuse me for my laymen terms. 
> > 
> 
> People have tried in the past and AFAIR nothing really conclusive
> came
> about, it really is a game of too many variables. 
> 
> Regards,
> 
> Christian

Again, thanks for everything, nicely explained. 

Not sure if it could be of interest for anyone, but I took a screenshot
of our fio-diagrams generated in confluence and put it here, https://im
gur.com/a/PaMLg 

Basically the only interesting bars are the yellow (test #3) and the
purple (test #4), as those are the ones where I actually know the exact
configuration. 

Interesting to see that enabling the raid controller and putting all
disks in raid 0 (disk cache disabled) vs. pass through yielded a quite
big enhance in the write128k/write8m and randwrite128k/randwrite8m
areas. Whereas in the other areas, there aren't that much of a
difference. So in my opinion, raid 0 would be the way to go - however I
see some different opinions about this. Red Hat talks about this in the
following pdf, https://www.redhat.com/cms/managed-files/st-rhcs-config-
guide-technology-detail-inc0387897-201604-en.pdf

Any thoughts about this ? 

Putting disks through the raid-controller also messes up the automatic
type-classification that ceph does, which is annoying - but
"workaroundable". As I understand it, ceph determines the disk class by
by looking at the value in /sys/block/<disk>/queue/rotational (1 is
hdd, 0 ssd). This value gets set correctly when using "pass through
(non-raid)", but when using raid 0, this gets set to 1 even though its
ssd's. 

We workaround this by using the following udev-rule, where "sd[a-c]"
would be the ssd-disks.

$ > echo 'ACTION=="add|change", KERNEL=="sd[a-c]",
ATTR{queue/rotational}="0"' >> /etc/udev/rules.d/10-ssd-
persistent.rules

Maybe this has been mentioned, but I'm curious on why this happens,
anyone knows ? 

Again, thanks for all the great work.

Best regards,
Patrik 
Sweden

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com