Hi Chris, thank you very much for your advice ! Currently, we have already running: osd_op_threads = 8 osd_max_backfills = 1 osd_recovery_max_active = 1 I will add your suggestions ! For sure there is a lot of space in tweaking the config, which is basically very basic. -- Mit freundlichen Gruessen / Best regards Oliver Dzombic IP-Interactive mailto:info@xxxxxxxxxxxxxxxxx Anschrift: IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3 63571 Gelnhausen HRB 93402 beim Amtsgericht Hanau Geschäftsführung: Oliver Dzombic Steuer Nr.: 35 236 3622 1 UST ID: DE274086107 Am 07.04.2016 um 20:32 schrieb Chris Taylor: > Hi Oliver, > > > > > > Have you tried tuning some of the cluster settings to fix the IO errors > in the VMs? > > We found some of the same issues when reweighting, backfilling and > removing large snapshots. By minimizing the number of concurrent > backfills and prioritizing client IO we can now add/remove OSDs without > the VMs throwing those nasty IO errors. > > We have been running a 3 node cluster for about a year now on Hammer > with 45 2TB SATA OSDs and no SSDs. It's backing KVM hosts and RBD images. > > Here are the things we changed: > > ceph tell osd.* injectargs '--osd-max-backfills 1' > ceph tell osd.* injectargs '--osd-max-recovery-threads 1' > ceph tell osd.* injectargs '--osd-recovery-op-priority 1' > ceph tell osd.* injectargs '--osd-client-op-priority 63' > ceph tell osd.* injectargs '--osd-recovery-max-active 1' > ceph tell osd.* injectargs '--osd-snap-trim-sleep 0.1' > > Recovery may take a little longer while backfilling, but the cluster is > still responsive and we have happy VMs now. > > I've collected these from various posts from the ceph-users list. > > Maybe they will help you if you haven't tried them already. > > > > Chris > > > > On 2016-04-07 4:18 am, Oliver Dzombic wrote: > >> Hi Christian, >> >> thank you for answering, i appriciate your time ! >> >> --- >> >> Its used for RBD hosted vm's and also cephfs hosted vm's. >> >> Well the basic problem is/was that single OSD's simply go out/down. >> Ending in SATA BUS error's for the VM's which have to be rebooted, if >> they anyway can, because as long as OSD's are missing in that szenario, >> the customer cant start their vm's. >> >> Installing/checking munin discovered a very high drive utilization. And >> this way simply an overload of the cluster. >> >> The initial setup was 4 nodes, with 4x mon and each 3x 6 TB HDD and 1x >> SSD for journal. >> >> So i started to add more OSD's ( 2 nodes, with 3x 6 TB HDD and 1x SSD >> for journal ). And, as first aid, reducing the replication from 3 to 2 >> to reduce the (write) load of the cluster. >> >> I planed to wait until the new LTS is out, but i already added now >> another node with 10x 3 TB HDD and 2x SSD for journal and 2-3x SSD for >> tier cache ( changing strategy and increasing the number of drives while >> reducing the size - was an design mistake from me ). >> >> osdmap e31602: 28 osds: 28 up, 28 in >> flags noscrub,nodeep-scrub >> pgmap v13849513: 1428 pgs, 5 pools, 19418 GB data, 4932 kobjects >> 39270 GB used, 88290 GB / 124 TB avail >> 1428 active+clean >> >> >> >> The range goes from 200 op/s to around 5000 op/s. >> >> The current avarage drive utilization is 20-30%. >> >> If we have backfill ( osd out/down ) or reweight the utilization of HDD >> drives is streight 90-100%. >> >> Munin shows on all drives ( except the SSD's ) a dislatency of avarage >> 170 ms. A minumum of 80-130 ms, and a maximum of 300-600ms. >> >> Currently, the 4 initial nodes are in datacenter A and the 3 other nodes >> are, together with most of the VM's in datacenter B. >> >> I am currently cleaning the 4 initial nodes by doing >> >> ceph osd reweight to peut a peut reducing the usage, to remove the osd's >> completely from there and just keeping up the monitors. >> >> The complete cluster have to move to one single datacenter together with >> all VM's. >> >> --- >> >> I am reducing the number of nodes because out of administrative view, >> its not very handy. I prefere extending the hardware power in terms of >> CPU, RAM and HDD. >> >> So the endcluster will look like: >> >> 3x OSD Nodes, each: >> >> 2x E5-2620v3 CPU, 128 GB RAM, 2x 10 Gbit Network, Adaptec HBA 1000-16e >> to connect to external JBOD servers holding the cold storage HDD's. >> Maybe ~ 24 drives in 2 or 3 TB SAS or SATA 7200 RPM's. >> >> I think SAS is, because of the reduces access times ( 4/5 ms vs. 10 ms ) >> very useful in a ceph environment. But then again, maybe with a cache >> tier the impact/difference is not really that big. >> >> That together with Samsung SM863 240 GB SSD's for journal and cache >> tier, connected to the board directly or to a seperated Adaptec HBA >> 1000-16i. >> >> So far the current idea/theory/plan. >> >> --- >> >> But to that point, its a long road. Last night i was doing a reweight of >> 3 OSD's from 1.0 to 0.9 ending up in one hdd was going down/out, so i >> had to restart the osd. ( with again IO errors in some of the vm's ). >> >> So based on your article, the cache tier solved your problem, and i >> think i have basically the same. >> >> --- >> >> So a very good hint is, to activate the whole tier cache in the night, >> when things are a bit more smooth. >> >> Any suggestions / critics / advices are highly welcome :-) >> >> Thank you! >> >> >> -- >> Mit freundlichen Gruessen / Best regards >> >> Oliver Dzombic >> IP-Interactive >> >> mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx> >> >> Anschrift: >> >> IP Interactive UG ( haftungsbeschraenkt ) >> Zum Sonnenberg 1-3 >> 63571 Gelnhausen >> >> HRB 93402 beim Amtsgericht Hanau >> Geschäftsführung: Oliver Dzombic >> >> Steuer Nr.: 35 236 3622 1 >> UST ID: DE274086107 >> >> >> Am 07.04.2016 um 05:32 schrieb Christian Balzer: >>> >>> Hello, >>> >>> On Wed, 6 Apr 2016 20:35:20 +0200 Oliver Dzombic wrote: >>> >>>> Hi, >>>> >>>> i have some IO issues, and after Christian's great article/hint about >>>> caches i plan to add caches too. >>>> >>> Thanks, version 2 is still a work in progress, as I keep running into >>> unknowns. >>> >>> IO issues in what sense, like in too many write IOPS for the current >>> HW to >>> sustain? >>> Also, what are you using Ceph for, RBD hosting VM images? >>> >>> It will help you a lot if you can identify and quantify the usage >>> patterns >>> (including a rough idea on how many hot objects you have) and where you >>> run into limits. >>> >>>> So now comes the troublesome question: >>>> >>>> How much dangerous is it to add cache tiers in an existing cluster with >>>> around 30 OSD's and 40 TB of Data on 3-6 ( currently reducing ) nodes. >>>> >>> You're reducing nodes? Why? >>> More nodes/OSDs equates to more IOPS in general. >>> >>> 40TB is a sizable amount of data, how many objects does you cluster hold? >>> Also is that raw data or after replication (size 3?)? >>> In short, "ceph -s" output please. ^.^ >>> >>>> I mean will just everything explode and i just die, or how is the road >>>> map to introduce this, after you have an already running cluster ? >>>> >>> That's pretty much straightforward from the Ceph docs at: >>> http://docs.ceph.com/docs/master/rados/operations/cache-tiering/ >>> (replace master with hammer if you're running that) >>> >>> Nothing happens until the "set-overlay" bit and you will want to >>> configure >>> all the pertinent bits before that. >>> >>> A basic question is if you will have dedicated SSD cache tier hosts or >>> have the SSDs holding the cache pool in your current hosts. >>> Dedicated hosts have the advantage matched HW, CPU power the SSDs and >>> simpler configuration, shared hosts can have the advantage of spreading >>> the network load further out instead of having everything going through >>> the cache tier nodes. >>> >>> The size and length of the explosion will entirely depend on: >>> 1) how capable your current cluster is, how (over)loaded it is. >>> 2) the actual load/usage at the time you phase the cache tier in >>> 3) the amount of "truly hot" objects you have. >>> >>> As I wrote here: >>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-February/007933.html >>> >>> In my case with a BADLY overload base pool and a constant stream of >>> log/status writes (4-5MB/s, 1000 IOPS) from 200VMs it all stabilized >>> after >>> 10 minutes. >>> >>> Truly hot objects as mentioned above will be those (in the case of VM >>> images) holding active directory inodes and files. >>> >>> >>>> Anything that needs to be considered ? Dangerous no-no's ? >>>> >>>> Also it will happen, that i have to add the cache tiers server by >>>> server, and not all at the same time. >>>> >>> You want at least 2 cache tier servers from the start and well known, >>> well tested (LSI timeouts!) SSDs in them. >>> >>> Christian >>> >>>> I am happy for any kind of advice. >>>> >>>> Thank you ! >>>> >>> >>> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com