Re: New Ceph-cluster and performance "questions"

Christian Balzer <chibi@xxxxxxx> · Fri, 9 Feb 2018 11:47:50 +0900

Hello,

On Thu, 8 Feb 2018 10:58:43 +0000 Patrik Martinsson wrote:

> Hi Christian, 
> 
> First of all, thanks for all the great answers and sorry for the late
> reply. 
> 
You're welcome.

> 
> On Tue, 2018-02-06 at 10:47 +0900, Christian Balzer wrote:
> > Hello,
> >   
> > > I'm not a "storage-guy" so please excuse me if I'm missing /
> > > overlooking something obvious. 
> > > 
> > > My question is in the area "what kind of performance am I to expect
> > > with this setup". We have bought servers, disks and networking for
> > > our
> > > future ceph-cluster and are now in our "testing-phase" and I simply
> > > want to understand if our numbers line up, or if we are missing
> > > something obvious. 
> > >   
> > 
> > A myriad of variables will make for a myriad of results, expected and
> > otherwise.
> > 
> > For example, you say nothing about the Ceph version, how the OSDs are
> > created (filestore, bluestore, details), OS and kernel (PTI!!)
> > version.  
> 
> Good catch, I totally forgot this. 
> 
> $ > ceph version 12.2.1-40.el7cp
> (c6d85fd953226c9e8168c9abe81f499d66cc2716) luminous (stable), deployed
> via Red Hat Ceph Storage 3 (ceph-ansible). Bluestore is enabled, and
> osd_scenario is set to collocated.
> 
Given the (rather disconcerting) number of bugs in Luminous, you probably
want to go to 12.2.2 now and .3 when released.

> $ > cat /etc/redhat-release 
> Red Hat Enterprise Linux Server release 7.4 (Maipo)
> 
> $ > uname -r 
> 3.10.0-693.11.6.el7.x86_64 (PTI *not* disabled at boot)
> 
That's what I'd call an old kernel, if it weren't for the (insane level of)
RH backporting. 

As for PTI, I'd disable it on pure Ceph nodes, the logic being that if
somebody can access those in the first place you have bigger problems
already.

Make sure to run a test/benchmark before and after and let the community
here know.

> 
> 
> > > Background, 
> > > - cephmon1, DELL R730, 1 x E5-2643, 64 GB 
> > > - cephosd1-6, DELL R730, 1 x E5-2697, 64 GB  
> > 
> > Unless you're planning on having 16 SSDs per node, a CPU with less
> > and
> > faster cores would be better (see archives). 
> > 
> > In general, you will want to run atop or something similar on your
> > ceph
> > and client nodes during these tests to see where and if any resources
> > (CPU, DISK, NET) are getting stressed.  
> 
> Understood, thanks!
> 
> 
> 
> > > - each server is connected to a dedicated 50 Gbe network, with
> > > Mellanox-4 Lx cards (teamed into one interface, team0).  
> > > 
> > > In our test we only have one monitor. This will of course not be
> > > the
> > > case later on. 
> > > 
> > > Each OSD, has the following SSD's configured as pass-through (not
> > > raid
> > > 0 through the raid-controller),
> > > 
> > > - 2 x Dell 1.6TB 2.5" SATA MLC MU 6Gbs SSD (THNSF81D60CSE), only
> > > spec I
> > > can find on Dell's homepage says "Data Transfer Rate 600 Mbps"
> > > - 4 x Intel SSD DC S3700 (https://ark.intel.com/products/71916/Inte
> > > l-SS
> > > D-DC-S3700-Series-800GB-2_5in-SATA-6Gbs-25nm-MLC)  
> > 
> > When and where did you get those?
> > I wonder if they're available again, had 0 luck getting any last
> > year.  
> 
> It's actually disks that we have had "lying around", no clue where you
> could get them today. 
> 
Consider yourself lucky.

> 
> 
> > > - 3 HDD's, which is uninteresting here. At the moment I'm only
> > > interested in the performance of the SSD-pool.
> > > 
> > > Ceph-cluster is created with ceph-ansible with "default params"
> > > (ie.
> > > have not added / changed anything except the necessary). 
> > > 
> > > When ceph-cluster is up, we have 54 OSD's (36 SSD, 18HDD). 
> > > The min_size is 3 on the pool.   
> > 
> > Any reason for that?
> > It will make any OSD failure result in a cluster lockup with a size
> > of 3.
> > Unless you did set your size to 4, in which case you wrecked
> > performance.  
> 
> Hm, sorry, what I meant was size=3. Reading the documentation, I'm not
> sure I understand the difference between size and min_size. 
>
Check the archives for this, lots of pertinent and moderately recent
discussions about this. 3 and 2 (defaults) are fine for most people.

> 
> 
> 
> > > Rules are created as follows, 
> > > 
> > > $ > ceph osd crush rule create-replicated ssd-rule default host ssd
> > > $ > ceph osd crush rule create-replicated hdd-rule default host hdd
> > > 
> > > Testing is done on a separate node (same nic and network though), 
> > > 
> > > $ > ceph osd pool create ssd-bench 512 512 replicated ssd-rule
> > > 
> > > $ > ceph osd pool application enable ssd-bench rbd
> > > 
> > > $ > rbd create ssd-image --size 1T --pool ssd-pool
> > > 
> > > $ > rbd map ssd-image --pool ssd-bench
> > > 
> > > $ > mkfs.xfs /dev/rbd/ssd-bench/ssd-image
> > > 
> > > $ > mount /dev/rbd/ssd-bench/ssd-image /ssd-bench
> > >   
> > 
> > Unless you're planning on using the Ceph cluster in this fashion
> > (kernel
> > mounted images), you'd be better off testing in an environment that
> > matches the use case, i.e. from a VM.  
> 
> Gotcha, thanks!
> 
> 
> 
> > > Fio is then run like this, 
> > > $ > 
> > > actions="read randread write randwrite"
> > > blocksizes="4k 128k 8m"
> > > tmp_dir="/tmp/"
> > > 
> > > for blocksize in ${blocksizes}; do
> > >   for action in ${actions}; do
> > >     rm -f ${tmp_dir}${action}_${blocksize}_${suffix}
> > >     fio --directory=/ssd-bench \
> > >         --time_based \ 
> > >         --direct=1 \
> > >         --rw=${action} \
> > >         --bs=$blocksize \
> > >         --size=1G \
> > >         --numjobs=100 \
> > >         --runtime=120 \
> > >         --group_reporting \
> > >         --name=testfile \
> > >         --output=${tmp_dir}${action}_${blocksize}_${suffix}
> > >   done
> > > done
> > > 
> > > After running this, we end up with these numbers 
> > > 
> > > read_4k         iops : 159266     throughput : 622    MB / sec
> > > randread_4k     iops : 151887     throughput : 593    MB / sec
> > >   
> > 
> > These are very nice numbers. 
> > Too nice, in my book.
> > I have a test cluster with a cache-tier based on 2 nodes with 3 DC
> > S3610s
> > 400GB each, obviously with size 2 and min_size=1. So just based on
> > that,
> > it will be faster than a size 3 pool, Jewel with Filestore.
> > Network is IPoIB (40Gb), so in that aspect similar to yours, 
> > 64k MTU though.
> > Ceph nodes have E5-2620 v3 @ 2.40GHz CPUs and 32GB RAM.
> > I've run the following fio (with different rw actions of course) from
> > a
> > KVM/qemu VM and am also showing how busy the SSDs, OSD processes,
> > qemu
> > process on the comp node and the fio inside the VM are:
> > "fio --size=4G --ioengine=libaio --invalidate=1 --direct=1 --
> > numjobs=1
> > --rw=read --name=fiojob --blocksize=4K --iodepth=64"
> > 
> > READ
> >   read : io=4096.0MB, bw=81361KB/s, iops=20340, runt= 51552msec
> > SSDs: 0% (pagecache on the nodes), OSDs: 45%, qemu: 330%, fio_in_VM:
> > 19%
> > 
> > RANDREAD
> >   read : io=4096.0MB, bw=62760KB/s, iops=15689, runt= 66831msec
> > SSDs: 0% (pagecache on the nodes), OSDs: 50%, qemu: 550%!!,
> > fio_in_VM: 23%
> > 
> > WRITE
> >   write: io=4096.0MB, bw=256972KB/s, iops=64243, runt= 16322msec
> > SSDs: 40%, OSDs: 20%, qemu: 150%, fio_in_VM: 45%
> > 
> > RANDWRITE
> >   write: io=4096.0MB, bw=43981KB/s, iops=10995, runt= 95366msec
> > SSDs: 38%, OSDs: 250%!!, qemu: 480%, fio_in_VM: 23%
> > 
> > Note especially the OSD CPU usage in the randwrite fio, this is where
> > faster (and non-powersaving mode) CPUs will be significant. 
> > I'm not seeing the same level of performance reductions with rand
> > actions
> > in your results.
> > 
> > We can roughly compare the reads as the SSDs and pool size play
> > little to
> > no part in it. 
> > 20k *6 (to compensate for your OSD numbers) is 120k, definitely the
> > same
> > ball park as your 158k.
> > It doesn't explain the 282k with your old setup, unless the MTU is
> > really
> > so significant (see below) or other things changed, like more   
> 
> Thanks for all that-  makes sense. I'm not sure I will dig so much
> deeper into why I got those numbers to begin with - it is a bit
> annoying though, but since we have little knowledge about the disks and
> that previous setup,  its impossible to compare (since as you say, in
> the beginning "myriad of variables will make for a myriad of results").
> 
[snip]
> > > 
> > > Ehm, in this new setup we are running with MTU 1500, I think we had
> > > the
> > > POC to 9000, but the difference on the read_4k is roughly 400
> > > MB/sec
> > > and I wonder if the MTU will make up for that. 
> > >   
> > 
> > You're in the best position of everybody here to verify this by
> > changing
> > your test cluster to use the other MTU and compare...  
> 
> Yes, we will do some more benchmarks and monitor the results.
> 
> 
> > > Is the above a good way of measuring our cluster, or is it better
> > > more
> > > reliable ways of measuring it ? 
> > >   
> > 
> > See above.
> > A fio test is definitely a closer thing to reality compared to OSD or
> > RADOS benches.
> >   
> > > Is there a way to calculate this "theoretically" (ie with with 6
> > > nodes
> > > and 36 SSD's we should get these numbers) and then compare it to
> > > the
> > > reality. Again, not a storage guy and haven't really done this
> > > before
> > > so please excuse me for my laymen terms. 
> > >   
> > 
> > People have tried in the past and AFAIR nothing really conclusive
> > came
> > about, it really is a game of too many variables. 
> > 
> 
> Again, thanks for everything, nicely explained. 
> 
> Not sure if it could be of interest for anyone, but I took a screenshot
> of our fio-diagrams generated in confluence and put it here, https://im
> gur.com/a/PaMLg 
> 
> Basically the only interesting bars are the yellow (test #3) and the
> purple (test #4), as those are the ones where I actually know the exact
> configuration. 
> 
> Interesting to see that enabling the raid controller and putting all
> disks in raid 0 (disk cache disabled) vs. pass through yielded a quite
> big enhance in the write128k/write8m and randwrite128k/randwrite8m
> areas. Whereas in the other areas, there aren't that much of a
> difference. So in my opinion, raid 0 would be the way to go - however I
> see some different opinions about this. Red Hat talks about this in the
> following pdf, https://www.redhat.com/cms/managed-files/st-rhcs-config-
> guide-technology-detail-inc0387897-201604-en.pdf
> 
> Any thoughts about this ? 
> 
If the controller cache is sizable it will of course help and if you're
willing to work with drawbacks you already discovered (SMART often is also
a PITA going though the controller) then it is indeed preferable.

Always keep in mind that this is masking the real device speeds though,
meaning that once the cache is overwhelmed it is back to the "slow" speeds.
Also failing battery backup units will disable the cache and leaving you
wondering why your machine suddenly got so slow.

Regards,

Christian
> Putting disks through the raid-controller also messes up the automatic
> type-classification that ceph does, which is annoying - but
> "workaroundable". As I understand it, ceph determines the disk class by
> by looking at the value in /sys/block/<disk>/queue/rotational (1 is
> hdd, 0 ssd). This value gets set correctly when using "pass through
> (non-raid)", but when using raid 0, this gets set to 1 even though its
> ssd's. 
> 
> We workaround this by using the following udev-rule, where "sd[a-c]"
> would be the ssd-disks.
> 
> $ > echo 'ACTION=="add|change", KERNEL=="sd[a-c]",
> ATTR{queue/rotational}="0"' >> /etc/udev/rules.d/10-ssd-
> persistent.rules
>  
> Maybe this has been mentioned, but I'm curious on why this happens,
> anyone knows ? 
> 
> Again, thanks for all the great work.
> 
> 
> Best regards,
> Patrik 
> Sweden
> 
> 
> 
>   

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Rakuten Communications
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com