Hi Christian, thank you for answering, i appriciate your time ! --- Its used for RBD hosted vm's and also cephfs hosted vm's. Well the basic problem is/was that single OSD's simply go out/down. Ending in SATA BUS error's for the VM's which have to be rebooted, if they anyway can, because as long as OSD's are missing in that szenario, the customer cant start their vm's. Installing/checking munin discovered a very high drive utilization. And this way simply an overload of the cluster. The initial setup was 4 nodes, with 4x mon and each 3x 6 TB HDD and 1x SSD for journal. So i started to add more OSD's ( 2 nodes, with 3x 6 TB HDD and 1x SSD for journal ). And, as first aid, reducing the replication from 3 to 2 to reduce the (write) load of the cluster. I planed to wait until the new LTS is out, but i already added now another node with 10x 3 TB HDD and 2x SSD for journal and 2-3x SSD for tier cache ( changing strategy and increasing the number of drives while reducing the size - was an design mistake from me ). osdmap e31602: 28 osds: 28 up, 28 in flags noscrub,nodeep-scrub pgmap v13849513: 1428 pgs, 5 pools, 19418 GB data, 4932 kobjects 39270 GB used, 88290 GB / 124 TB avail 1428 active+clean The range goes from 200 op/s to around 5000 op/s. The current avarage drive utilization is 20-30%. If we have backfill ( osd out/down ) or reweight the utilization of HDD drives is streight 90-100%. Munin shows on all drives ( except the SSD's ) a dislatency of avarage 170 ms. A minumum of 80-130 ms, and a maximum of 300-600ms. Currently, the 4 initial nodes are in datacenter A and the 3 other nodes are, together with most of the VM's in datacenter B. I am currently cleaning the 4 initial nodes by doing ceph osd reweight to peut a peut reducing the usage, to remove the osd's completely from there and just keeping up the monitors. The complete cluster have to move to one single datacenter together with all VM's. --- I am reducing the number of nodes because out of administrative view, its not very handy. I prefere extending the hardware power in terms of CPU, RAM and HDD. So the endcluster will look like: 3x OSD Nodes, each: 2x E5-2620v3 CPU, 128 GB RAM, 2x 10 Gbit Network, Adaptec HBA 1000-16e to connect to external JBOD servers holding the cold storage HDD's. Maybe ~ 24 drives in 2 or 3 TB SAS or SATA 7200 RPM's. I think SAS is, because of the reduces access times ( 4/5 ms vs. 10 ms ) very useful in a ceph environment. But then again, maybe with a cache tier the impact/difference is not really that big. That together with Samsung SM863 240 GB SSD's for journal and cache tier, connected to the board directly or to a seperated Adaptec HBA 1000-16i. So far the current idea/theory/plan. --- But to that point, its a long road. Last night i was doing a reweight of 3 OSD's from 1.0 to 0.9 ending up in one hdd was going down/out, so i had to restart the osd. ( with again IO errors in some of the vm's ). So based on your article, the cache tier solved your problem, and i think i have basically the same. --- So a very good hint is, to activate the whole tier cache in the night, when things are a bit more smooth. Any suggestions / critics / advices are highly welcome :-) Thank you! -- Mit freundlichen Gruessen / Best regards Oliver Dzombic IP-Interactive mailto:info@xxxxxxxxxxxxxxxxx Anschrift: IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3 63571 Gelnhausen HRB 93402 beim Amtsgericht Hanau Geschäftsführung: Oliver Dzombic Steuer Nr.: 35 236 3622 1 UST ID: DE274086107 Am 07.04.2016 um 05:32 schrieb Christian Balzer: > > Hello, > > On Wed, 6 Apr 2016 20:35:20 +0200 Oliver Dzombic wrote: > >> Hi, >> >> i have some IO issues, and after Christian's great article/hint about >> caches i plan to add caches too. >> > Thanks, version 2 is still a work in progress, as I keep running into > unknowns. > > IO issues in what sense, like in too many write IOPS for the current HW to > sustain? > Also, what are you using Ceph for, RBD hosting VM images? > > It will help you a lot if you can identify and quantify the usage patterns > (including a rough idea on how many hot objects you have) and where you > run into limits. > >> So now comes the troublesome question: >> >> How much dangerous is it to add cache tiers in an existing cluster with >> around 30 OSD's and 40 TB of Data on 3-6 ( currently reducing ) nodes. >> > You're reducing nodes? Why? > More nodes/OSDs equates to more IOPS in general. > > 40TB is a sizable amount of data, how many objects does you cluster hold? > Also is that raw data or after replication (size 3?)? > In short, "ceph -s" output please. ^.^ > >> I mean will just everything explode and i just die, or how is the road >> map to introduce this, after you have an already running cluster ? >> > That's pretty much straightforward from the Ceph docs at: > http://docs.ceph.com/docs/master/rados/operations/cache-tiering/ > (replace master with hammer if you're running that) > > Nothing happens until the "set-overlay" bit and you will want to configure > all the pertinent bits before that. > > A basic question is if you will have dedicated SSD cache tier hosts or > have the SSDs holding the cache pool in your current hosts. > Dedicated hosts have the advantage matched HW, CPU power the SSDs and > simpler configuration, shared hosts can have the advantage of spreading > the network load further out instead of having everything going through > the cache tier nodes. > > The size and length of the explosion will entirely depend on: > 1) how capable your current cluster is, how (over)loaded it is. > 2) the actual load/usage at the time you phase the cache tier in > 3) the amount of "truly hot" objects you have. > > As I wrote here: > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-February/007933.html > > In my case with a BADLY overload base pool and a constant stream of > log/status writes (4-5MB/s, 1000 IOPS) from 200VMs it all stabilized after > 10 minutes. > > Truly hot objects as mentioned above will be those (in the case of VM > images) holding active directory inodes and files. > > >> Anything that needs to be considered ? Dangerous no-no's ? >> >> Also it will happen, that i have to add the cache tiers server by >> server, and not all at the same time. >> > You want at least 2 cache tier servers from the start and well known, > well tested (LSI timeouts!) SSDs in them. > > Christian > >> I am happy for any kind of advice. >> >> Thank you ! >> > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com