Re: adding cache tier in productive hammer environment

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Oliver,

 

 

Have you tried tuning some of the cluster settings to fix the IO errors in the VMs?

We found some of the same issues when reweighting, backfilling and removing large snapshots. By minimizing the number of concurrent backfills and prioritizing client IO we can now add/remove OSDs without the VMs throwing those nasty IO errors.

We have been running a 3 node cluster for about a year now on Hammer with 45 2TB SATA OSDs and no SSDs. It's backing KVM hosts and RBD images.

Here are the things we changed:

ceph tell osd.* injectargs '--osd-max-backfills 1'
ceph tell osd.* injectargs '--osd-max-recovery-threads 1'
ceph tell osd.* injectargs '--osd-recovery-op-priority 1'
ceph tell osd.* injectargs '--osd-client-op-priority 63'
ceph tell osd.* injectargs '--osd-recovery-max-active 1'
ceph tell osd.* injectargs '--osd-snap-trim-sleep 0.1'

Recovery may take a little longer while backfilling, but the cluster is still responsive and we have happy VMs now.

I've collected these from various posts from the ceph-users list.

Maybe they will help you if you haven't tried them already.

 

Chris

 

On 2016-04-07 4:18 am, Oliver Dzombic wrote:

Hi Christian,

thank you for answering, i appriciate your time !

---

Its used for RBD hosted vm's and also cephfs hosted vm's.

Well the basic problem is/was that single OSD's simply go out/down.
Ending in SATA BUS error's for the VM's which have to be rebooted, if
they anyway can, because as long as OSD's are missing in that szenario,
the customer cant start their vm's.

Installing/checking munin discovered a very high drive utilization. And
this way simply an overload of the cluster.

The initial setup was 4 nodes, with 4x mon and each 3x 6 TB HDD and 1x
SSD for journal.

So i started to add more OSD's ( 2 nodes, with 3x 6 TB HDD and 1x SSD
for journal ). And, as first aid, reducing the replication from 3 to 2
to reduce the (write) load of the cluster.

I planed to wait until the new LTS is out, but i already added now
another node with 10x 3 TB HDD and 2x SSD for journal and 2-3x SSD for
tier cache ( changing strategy and increasing the number of drives while
reducing the size - was an design mistake from me ).

 osdmap e31602: 28 osds: 28 up, 28 in
            flags noscrub,nodeep-scrub
      pgmap v13849513: 1428 pgs, 5 pools, 19418 GB data, 4932 kobjects
            39270 GB used, 88290 GB / 124 TB avail
                1428 active+clean



The range goes from 200 op/s to around 5000 op/s.

The current avarage drive utilization is 20-30%.

If we have backfill ( osd out/down ) or reweight the utilization of HDD
drives is streight 90-100%.

Munin shows on all drives ( except the SSD's ) a dislatency of avarage
170 ms. A minumum  of 80-130 ms, and a maximum of 300-600ms.

Currently, the 4 initial nodes are in datacenter A and the 3 other nodes
are, together with most of the VM's in datacenter B.

I am currently cleaning the 4 initial nodes by doing

ceph osd reweight to peut a peut reducing the usage, to remove the osd's
completely from there and just keeping up the monitors.

The complete cluster have to move to one single datacenter together with
all VM's.

---

I am reducing the number of nodes because out of administrative view,
its not very handy. I prefere extending the hardware power in terms of
CPU, RAM and HDD.

So the endcluster will look like:

3x OSD Nodes, each:

2x E5-2620v3 CPU, 128 GB RAM, 2x 10 Gbit Network, Adaptec HBA 1000-16e
to connect to external JBOD servers holding the cold storage HDD's.
Maybe ~ 24 drives in 2 or 3 TB SAS or SATA 7200 RPM's.

I think SAS is, because of the reduces access times ( 4/5 ms vs. 10 ms )
very useful in a ceph environment. But then again, maybe with a cache
tier the impact/difference is not really that big.

That together with Samsung SM863 240 GB SSD's for journal and cache
tier, connected to the board directly or to a seperated Adaptec HBA
1000-16i.

So far the current idea/theory/plan.

---

But to that point, its a long road. Last night i was doing a reweight of
3 OSD's from 1.0 to 0.9 ending up in one hdd was going down/out, so i
had to restart the osd. ( with again IO errors in some of the vm's ).

So based on your article, the cache tier solved your problem, and i
think i have basically the same.

---

So a very good hint is, to activate the whole tier cache in the night,
when things are a bit more smooth.

Any suggestions / critics / advices are highly welcome :-)

Thank you!


-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:info@xxxxxxxxxxxxxxxxx

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 07.04.2016 um 05:32 schrieb Christian Balzer:

Hello,

On Wed, 6 Apr 2016 20:35:20 +0200 Oliver Dzombic wrote:

Hi,

i have some IO issues, and after Christian's great article/hint about
caches i plan to add caches too.

Thanks, version 2 is still a work in progress, as I keep running into
unknowns.

IO issues in what sense, like in too many write IOPS for the current HW to
sustain?
Also, what are you using Ceph for, RBD hosting VM images?

It will help you a lot if you can identify and quantify the usage patterns
(including a rough idea on how many hot objects you have) and where you
run into limits.

So now comes the troublesome question:

How much dangerous is it to add cache tiers in an existing cluster with
around 30 OSD's and 40 TB of Data on 3-6 ( currently reducing ) nodes.

You're reducing nodes? Why?
More nodes/OSDs equates to more IOPS in general.

40TB is a sizable amount of data, how many objects does you cluster hold?
Also is that raw data or after replication (size 3?)?
In short, "ceph -s" output please. ^.^

I mean will just everything explode and i just die, or how is the road
map to introduce this, after you have an already running cluster ?

That's pretty much straightforward from the Ceph docs at:
http://docs.ceph.com/docs/master/rados/operations/cache-tiering/
(replace master with hammer if you're running that)

Nothing happens until the "set-overlay" bit and you will want to configure
all the pertinent bits before that.

A basic question is if you will have dedicated SSD cache tier hosts or
have the SSDs holding the cache pool in your current hosts.
Dedicated hosts have the advantage matched HW, CPU power the SSDs and
simpler configuration, shared hosts can have the advantage of spreading
the network load further out instead of having everything going through
the cache tier nodes.

The size and length of the explosion will entirely depend on:
1) how capable your current cluster is, how (over)loaded it is.
2) the actual load/usage at the time you phase the cache tier in
3) the amount of "truly hot" objects you have.

As I wrote here:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-February/007933.html

In my case with a BADLY overload base pool and a constant stream of
log/status writes (4-5MB/s, 1000 IOPS) from 200VMs it all stabilized after
10 minutes.

Truly hot objects as mentioned above will be those (in the case of VM
images) holding active directory inodes and files.


Anything that needs to be considered ? Dangerous no-no's ?

Also it will happen, that i have to add the cache tiers server by
server, and not all at the same time.

You want at least 2 cache tier servers from the start and well known,
well tested (LSI timeouts!) SSDs in them.

Christian

I am happy for any kind of advice.

Thank you !



_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux