Re: adding cache tier in productive hammer environment

Oliver Dzombic <info@xxxxxxxxxxxxxxxxx> · Thu, 7 Apr 2016 20:59:29 +0200

Hi Chris,

thank you very much for your advice !

Currently, we have already running:

osd_op_threads = 8
osd_max_backfills = 1
osd_recovery_max_active = 1

I will add your suggestions !

For sure there is a lot of space in tweaking the config, which is
basically very basic.

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:info@xxxxxxxxxxxxxxxxx

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107

Am 07.04.2016 um 20:32 schrieb Chris Taylor:
> Hi Oliver,
> 
>  
> 
>  
> 
> Have you tried tuning some of the cluster settings to fix the IO errors
> in the VMs?
> 
> We found some of the same issues when reweighting, backfilling and
> removing large snapshots. By minimizing the number of concurrent
> backfills and prioritizing client IO we can now add/remove OSDs without
> the VMs throwing those nasty IO errors.
> 
> We have been running a 3 node cluster for about a year now on Hammer
> with 45 2TB SATA OSDs and no SSDs. It's backing KVM hosts and RBD images.
> 
> Here are the things we changed:
> 
> ceph tell osd.* injectargs '--osd-max-backfills 1'
> ceph tell osd.* injectargs '--osd-max-recovery-threads 1'
> ceph tell osd.* injectargs '--osd-recovery-op-priority 1'
> ceph tell osd.* injectargs '--osd-client-op-priority 63'
> ceph tell osd.* injectargs '--osd-recovery-max-active 1'
> ceph tell osd.* injectargs '--osd-snap-trim-sleep 0.1'
> 
> Recovery may take a little longer while backfilling, but the cluster is
> still responsive and we have happy VMs now.
> 
> I've collected these from various posts from the ceph-users list.
> 
> Maybe they will help you if you haven't tried them already.
> 
>  
> 
> Chris
> 
>  
> 
> On 2016-04-07 4:18 am, Oliver Dzombic wrote:
> 
>> Hi Christian,
>>
>> thank you for answering, i appriciate your time !
>>
>> ---
>>
>> Its used for RBD hosted vm's and also cephfs hosted vm's.
>>
>> Well the basic problem is/was that single OSD's simply go out/down.
>> Ending in SATA BUS error's for the VM's which have to be rebooted, if
>> they anyway can, because as long as OSD's are missing in that szenario,
>> the customer cant start their vm's.
>>
>> Installing/checking munin discovered a very high drive utilization. And
>> this way simply an overload of the cluster.
>>
>> The initial setup was 4 nodes, with 4x mon and each 3x 6 TB HDD and 1x
>> SSD for journal.
>>
>> So i started to add more OSD's ( 2 nodes, with 3x 6 TB HDD and 1x SSD
>> for journal ). And, as first aid, reducing the replication from 3 to 2
>> to reduce the (write) load of the cluster.
>>
>> I planed to wait until the new LTS is out, but i already added now
>> another node with 10x 3 TB HDD and 2x SSD for journal and 2-3x SSD for
>> tier cache ( changing strategy and increasing the number of drives while
>> reducing the size - was an design mistake from me ).
>>
>>  osdmap e31602: 28 osds: 28 up, 28 in
>>             flags noscrub,nodeep-scrub
>>       pgmap v13849513: 1428 pgs, 5 pools, 19418 GB data, 4932 kobjects
>>             39270 GB used, 88290 GB / 124 TB avail
>>                 1428 active+clean
>>
>>
>>
>> The range goes from 200 op/s to around 5000 op/s.
>>
>> The current avarage drive utilization is 20-30%.
>>
>> If we have backfill ( osd out/down ) or reweight the utilization of HDD
>> drives is streight 90-100%.
>>
>> Munin shows on all drives ( except the SSD's ) a dislatency of avarage
>> 170 ms. A minumum  of 80-130 ms, and a maximum of 300-600ms.
>>
>> Currently, the 4 initial nodes are in datacenter A and the 3 other nodes
>> are, together with most of the VM's in datacenter B.
>>
>> I am currently cleaning the 4 initial nodes by doing
>>
>> ceph osd reweight to peut a peut reducing the usage, to remove the osd's
>> completely from there and just keeping up the monitors.
>>
>> The complete cluster have to move to one single datacenter together with
>> all VM's.
>>
>> ---
>>
>> I am reducing the number of nodes because out of administrative view,
>> its not very handy. I prefere extending the hardware power in terms of
>> CPU, RAM and HDD.
>>
>> So the endcluster will look like:
>>
>> 3x OSD Nodes, each:
>>
>> 2x E5-2620v3 CPU, 128 GB RAM, 2x 10 Gbit Network, Adaptec HBA 1000-16e
>> to connect to external JBOD servers holding the cold storage HDD's.
>> Maybe ~ 24 drives in 2 or 3 TB SAS or SATA 7200 RPM's.
>>
>> I think SAS is, because of the reduces access times ( 4/5 ms vs. 10 ms )
>> very useful in a ceph environment. But then again, maybe with a cache
>> tier the impact/difference is not really that big.
>>
>> That together with Samsung SM863 240 GB SSD's for journal and cache
>> tier, connected to the board directly or to a seperated Adaptec HBA
>> 1000-16i.
>>
>> So far the current idea/theory/plan.
>>
>> ---
>>
>> But to that point, its a long road. Last night i was doing a reweight of
>> 3 OSD's from 1.0 to 0.9 ending up in one hdd was going down/out, so i
>> had to restart the osd. ( with again IO errors in some of the vm's ).
>>
>> So based on your article, the cache tier solved your problem, and i
>> think i have basically the same.
>>
>> ---
>>
>> So a very good hint is, to activate the whole tier cache in the night,
>> when things are a bit more smooth.
>>
>> Any suggestions / critics / advices are highly welcome :-)
>>
>> Thank you!
>>
>>
>> -- 
>> Mit freundlichen Gruessen / Best regards
>>
>> Oliver Dzombic
>> IP-Interactive
>>
>> mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>
>>
>> Anschrift:
>>
>> IP Interactive UG ( haftungsbeschraenkt )
>> Zum Sonnenberg 1-3
>> 63571 Gelnhausen
>>
>> HRB 93402 beim Amtsgericht Hanau
>> Geschäftsführung: Oliver Dzombic
>>
>> Steuer Nr.: 35 236 3622 1
>> UST ID: DE274086107
>>
>>
>> Am 07.04.2016 um 05:32 schrieb Christian Balzer:
>>>
>>> Hello,
>>>
>>> On Wed, 6 Apr 2016 20:35:20 +0200 Oliver Dzombic wrote:
>>>
>>>> Hi,
>>>>
>>>> i have some IO issues, and after Christian's great article/hint about
>>>> caches i plan to add caches too.
>>>>
>>> Thanks, version 2 is still a work in progress, as I keep running into
>>> unknowns.
>>>
>>> IO issues in what sense, like in too many write IOPS for the current
>>> HW to
>>> sustain?
>>> Also, what are you using Ceph for, RBD hosting VM images?
>>>
>>> It will help you a lot if you can identify and quantify the usage
>>> patterns
>>> (including a rough idea on how many hot objects you have) and where you
>>> run into limits.
>>>
>>>> So now comes the troublesome question:
>>>>
>>>> How much dangerous is it to add cache tiers in an existing cluster with
>>>> around 30 OSD's and 40 TB of Data on 3-6 ( currently reducing ) nodes.
>>>>
>>> You're reducing nodes? Why?
>>> More nodes/OSDs equates to more IOPS in general.
>>>
>>> 40TB is a sizable amount of data, how many objects does you cluster hold?
>>> Also is that raw data or after replication (size 3?)?
>>> In short, "ceph -s" output please. ^.^
>>>
>>>> I mean will just everything explode and i just die, or how is the road
>>>> map to introduce this, after you have an already running cluster ?
>>>>
>>> That's pretty much straightforward from the Ceph docs at:
>>> http://docs.ceph.com/docs/master/rados/operations/cache-tiering/
>>> (replace master with hammer if you're running that)
>>>
>>> Nothing happens until the "set-overlay" bit and you will want to
>>> configure
>>> all the pertinent bits before that.
>>>
>>> A basic question is if you will have dedicated SSD cache tier hosts or
>>> have the SSDs holding the cache pool in your current hosts.
>>> Dedicated hosts have the advantage matched HW, CPU power the SSDs and
>>> simpler configuration, shared hosts can have the advantage of spreading
>>> the network load further out instead of having everything going through
>>> the cache tier nodes.
>>>
>>> The size and length of the explosion will entirely depend on:
>>> 1) how capable your current cluster is, how (over)loaded it is.
>>> 2) the actual load/usage at the time you phase the cache tier in
>>> 3) the amount of "truly hot" objects you have.
>>>
>>> As I wrote here:
>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-February/007933.html
>>>
>>> In my case with a BADLY overload base pool and a constant stream of
>>> log/status writes (4-5MB/s, 1000 IOPS) from 200VMs it all stabilized
>>> after
>>> 10 minutes.
>>>
>>> Truly hot objects as mentioned above will be those (in the case of VM
>>> images) holding active directory inodes and files.
>>>
>>>
>>>> Anything that needs to be considered ? Dangerous no-no's ?
>>>>
>>>> Also it will happen, that i have to add the cache tiers server by
>>>> server, and not all at the same time.
>>>>
>>> You want at least 2 cache tier servers from the start and well known,
>>> well tested (LSI timeouts!) SSDs in them.
>>>
>>> Christian
>>>
>>>> I am happy for any kind of advice.
>>>>
>>>> Thank you !
>>>>
>>>
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com