Re: adding cache tier in productive hammer environment

Chris Taylor <ctaylor@xxxxxxxxxx> · Thu, 07 Apr 2016 11:32:38 -0700

Hi Oliver,

Have you tried tuning some of the cluster settings to fix the IO errors in the VMs?
We found some of the same issues when reweighting, backfilling and removing large snapshots. By minimizing the number of concurrent backfills and prioritizing client IO we can now add/remove OSDs without the VMs throwing those nasty IO errors.
We have been running a 3 node cluster for about a year now on Hammer with 45 2TB SATA OSDs and no SSDs. It's backing KVM hosts and RBD images.
Here are the things we changed:
ceph tell osd.* injectargs '--osd-max-backfills 1'
ceph tell osd.* injectargs '--osd-max-recovery-threads 1'
ceph tell osd.* injectargs '--osd-recovery-op-priority 1'
ceph tell osd.* injectargs '--osd-client-op-priority 63'
ceph tell osd.* injectargs '--osd-recovery-max-active 1'
ceph tell osd.* injectargs '--osd-snap-trim-sleep 0.1'
Recovery may take a little longer while backfilling, but the cluster is still responsive and we have happy VMs now.
I've collected these from various posts from the ceph-users list.
Maybe they will help you if you haven't tried them already.

Chris

On 2016-04-07 4:18 am, Oliver Dzombic wrote:

Hi Christian,

 thank you for answering, i appriciate your time !

 ---

 Its used for RBD hosted vm's and also cephfs hosted vm's.

 Well the basic problem is/was that single OSD's simply go out/down.
 Ending in SATA BUS error's for the VM's which have to be rebooted, if
 they anyway can, because as long as OSD's are missing in that szenario,
 the customer cant start their vm's.

 Installing/checking munin discovered a very high drive utilization. And
 this way simply an overload of the cluster.

 The initial setup was 4 nodes, with 4x mon and each 3x 6 TB HDD and 1x
 SSD for journal.

 So i started to add more OSD's ( 2 nodes, with 3x 6 TB HDD and 1x SSD
 for journal ). And, as first aid, reducing the replication from 3 to 2
 to reduce the (write) load of the cluster.

 I planed to wait until the new LTS is out, but i already added now
 another node with 10x 3 TB HDD and 2x SSD for journal and 2-3x SSD for
 tier cache ( changing strategy and increasing the number of drives while
 reducing the size - was an design mistake from me ).

  osdmap e31602: 28 osds: 28 up, 28 in
             flags noscrub,nodeep-scrub
       pgmap v13849513: 1428 pgs, 5 pools, 19418 GB data, 4932 kobjects
             39270 GB used, 88290 GB / 124 TB avail
                 1428 active+clean

 The range goes from 200 op/s to around 5000 op/s.

 The current avarage drive utilization is 20-30%.

 If we have backfill ( osd out/down ) or reweight the utilization of HDD
 drives is streight 90-100%.

 Munin shows on all drives ( except the SSD's ) a dislatency of avarage
 170 ms. A minumum  of 80-130 ms, and a maximum of 300-600ms.

 Currently, the 4 initial nodes are in datacenter A and the 3 other nodes
 are, together with most of the VM's in datacenter B.

 I am currently cleaning the 4 initial nodes by doing

 ceph osd reweight to peut a peut reducing the usage, to remove the osd's
 completely from there and just keeping up the monitors.

 The complete cluster have to move to one single datacenter together with
 all VM's.

 ---

 I am reducing the number of nodes because out of administrative view,
 its not very handy. I prefere extending the hardware power in terms of
 CPU, RAM and HDD.

 So the endcluster will look like:

 3x OSD Nodes, each:

 2x E5-2620v3 CPU, 128 GB RAM, 2x 10 Gbit Network, Adaptec HBA 1000-16e
 to connect to external JBOD servers holding the cold storage HDD's.
 Maybe ~ 24 drives in 2 or 3 TB SAS or SATA 7200 RPM's.

 I think SAS is, because of the reduces access times ( 4/5 ms vs. 10 ms )
 very useful in a ceph environment. But then again, maybe with a cache
 tier the impact/difference is not really that big.

 That together with Samsung SM863 240 GB SSD's for journal and cache
 tier, connected to the board directly or to a seperated Adaptec HBA
 1000-16i.

 So far the current idea/theory/plan.

 ---

 But to that point, its a long road. Last night i was doing a reweight of
 3 OSD's from 1.0 to 0.9 ending up in one hdd was going down/out, so i
 had to restart the osd. ( with again IO errors in some of the vm's ).

 So based on your article, the cache tier solved your problem, and i
 think i have basically the same.

 ---

 So a very good hint is, to activate the whole tier cache in the night,
 when things are a bit more smooth.

 Any suggestions / critics / advices are highly welcome :-)

 Thank you!

 -- 
 Mit freundlichen Gruessen / Best regards

 Oliver Dzombic
 IP-Interactive

 mailto:info@xxxxxxxxxxxxxxxxx

 Anschrift:

 IP Interactive UG ( haftungsbeschraenkt )
 Zum Sonnenberg 1-3
 63571 Gelnhausen

 HRB 93402 beim Amtsgericht Hanau
 Geschäftsführung: Oliver Dzombic

 Steuer Nr.: 35 236 3622 1
 UST ID: DE274086107

 Am 07.04.2016 um 05:32 schrieb Christian Balzer:

 Hello,

 On Wed, 6 Apr 2016 20:35:20 +0200 Oliver Dzombic wrote:

Hi,

 i have some IO issues, and after Christian's great article/hint about
 caches i plan to add caches too.

Thanks, version 2 is still a work in progress, as I keep running into
 unknowns. 

 IO issues in what sense, like in too many write IOPS for the current HW to
 sustain? 
 Also, what are you using Ceph for, RBD hosting VM images?

 It will help you a lot if you can identify and quantify the usage patterns
 (including a rough idea on how many hot objects you have) and where you
 run into limits.

So now comes the troublesome question:

 How much dangerous is it to add cache tiers in an existing cluster with
 around 30 OSD's and 40 TB of Data on 3-6 ( currently reducing ) nodes.

You're reducing nodes? Why? 
 More nodes/OSDs equates to more IOPS in general.

 40TB is a sizable amount of data, how many objects does you cluster hold?
 Also is that raw data or after replication (size 3?)?
 In short, "ceph -s" output please. ^.^

I mean will just everything explode and i just die, or how is the road
 map to introduce this, after you have an already running cluster ?

That's pretty much straightforward from the Ceph docs at:
http://docs.ceph.com/docs/master/rados/operations/cache-tiering/
 (replace master with hammer if you're running that)

 Nothing happens until the "set-overlay" bit and you will want to configure
 all the pertinent bits before that.

 A basic question is if you will have dedicated SSD cache tier hosts or
 have the SSDs holding the cache pool in your current hosts.
 Dedicated hosts have the advantage matched HW, CPU power the SSDs and
 simpler configuration, shared hosts can have the advantage of spreading
 the network load further out instead of having everything going through
 the cache tier nodes.

 The size and length of the explosion will entirely depend on:
 1) how capable your current cluster is, how (over)loaded it is.
 2) the actual load/usage at the time you phase the cache tier in
 3) the amount of "truly hot" objects you have.

 As I wrote here:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-February/007933.html

 In my case with a BADLY overload base pool and a constant stream of
 log/status writes (4-5MB/s, 1000 IOPS) from 200VMs it all stabilized after
 10 minutes.

 Truly hot objects as mentioned above will be those (in the case of VM
 images) holding active directory inodes and files.

Anything that needs to be considered ? Dangerous no-no's ?

 Also it will happen, that i have to add the cache tiers server by
 server, and not all at the same time.

You want at least 2 cache tier servers from the start and well known,
 well tested (LSI timeouts!) SSDs in them.

 Christian

I am happy for any kind of advice.

 Thank you !

_______________________________________________
 ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com