Hello nick, On Mon, 7 Mar 2016 08:30:52 -0000 Nick Fisk wrote: > Hi Christian, > > > -----Original Message----- > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf > > Of Christian Balzer > > Sent: 07 March 2016 02:22 > > To: ceph-users <ceph-users@xxxxxxxxxxxxxx> > > Subject: Re: Cache tier operation clarifications > > > > > > Hello, > > > > I'd like to get some insights, confirmations from people here who are > either > > familiar with the code or have this tested more empirically than me > > (the VM/client node of my test cluster is currently pinning for the > > fjords). > > > > When it comes to flushing/evicting we already established that this > triggers > > based on PG utilization, not a pool wide one. > > So for example in a pool with 1024TB capacity (set via > > target_max_bytes) > and > > 1024 PGs and a cache_target_dirty_ratio of 0.5 flushing will start when > the > > first PG reaches 512MB utilization. > > > > However while the documentation states that the least recently objects > > are evicted when things hit the cache_target_full_ratio, it is less > > than clear (understatement of the year) when flushing is concerned. > > To quote: > > "When the cache pool consists of a certain percentage of modified (or > > dirty) objects, the cache tiering agent will flush them to the storage > pool." > > > > How do we read this? > > When hitting 50% (as in the example above) all of the dirty objects > > will > get > > flushed? > > That doesn't match what I'm seeing nor would it be a sensible course of > > action to unleash such a potentially huge torrent of writes. > > > > If we interpret this as "get the dirty objects below the > > threshold" (which > is > > what seems to happen) there are 2 possible courses of action here: > > > > 1. Flush dirty object(s) from the PG that has reached the threshold. > > A sensible course of action in terms of reducing I/Os, but it may keep > flushing > > the same objects over and over again if they happen to be on the "full" > PG. > > I think this is how it works. The agents/hitsets work at the per PG level > and the flushing code is very closely linked. I can't be 100% sure, but > I'm 90%+ sure. > > https://github.com/ceph/ceph/blob/master/src/osd/ReplicatedPG.cc#L11967 > I'm quite illiterate in C++, but yeah, that sesms to be happening here. Which is not totally unsurprising. > > It uses that cache_min_flush_age variable to check if the object is old > enough to be flushed, but I can't see any logic as to how it selects > objects in the first place. It almost looks like it just cycles through > all the objects in order, it would be nice to have this confirmed. > This would mesh with no such logic being mentioned in the meager documentation, mind. > In releases after Hammer there are two thresholds that flush at different > speeds. This can help as > > 1. It means that at the low threshold it uses less IO to flush > 2. Between low and high thresholds the cache effectively cleans itself > down to the low threshold during idle periods. So it's ready to absorb > bursts of writes when your workloads get busy. > Yup, I know and the role you played in getting that implemented. > You need to play around with the max_agent_ops variable for both which > controls how many concurrent flushes can occur, so that during normal > behaviour the % dirty is somewhere between the low and high thresholds. > This is indeed quite nice and a very good method to keep things in balance for a cache/cluster that is pretty busy and getting filled/dirtied at a high pace. Our cache pool usage grows by less than 10% a week though, so for us a constant trickle back to the HDD OSD base pool is less desirable than flushing larger chunks in off-peak hours and have the base pool idle during peak hours, ready to serve read requests at full speed. This would also tie in well with read-forward cache mode. That's why I'm pondering a ratio fondling cronjob on certain nights. > Although at the moment none of this is accessible to you, until you > upgrade to Jewel in the future. > > > > > 2. Flush dirty objects from all PGs (most likely in a least recently > > used fashion) and stop when we're eventually under the threshold by > > having finally hit the "full" PG. > > Results in a lot more IO but will of course create more clean objects > available > > for eviction if needed. > > This is what I think is happening. > > > > So, is there any "least recently used" consideration in effect here, > > or is > the > > only way to avoid (pointless) flushes by setting "cache_min_flush_age" > > accordingly? > > > > Unlike for flushes above, eviction clearly states that it's going by > "least > > recently used". > > Which in the case of per PG operation would violate that promise, as > people > > of course expect this to be pool wide. Let me beat that concussed horse a bit more, I can see easily see a scenario where a lot of busy objects wind up on the same PG and thus get flushed/evicted while other PGs have far more stale data and never get touched because they're small enough. Unfortunate, inelegant and certainly not expected. > > And if it is indeed pool wide, the same effect as above will happen, > evictions > > will happen until the "full" PG gets hit, evicting far more than would > have > > been needed. > > > > > > Something to maybe consider would be a target value, for example with > > "cache_target_full_ratio" at 0.80 and "cache_target_full_ratio_target" > > at 0.78, evicting things until it reaches the target ratio. > > How is that any different from target_max_bytes (which is effectively > 1.0) and cache_target_full_ratio = 0.8? > Firstly, my understanding is that reaching target_max_bytes brings things to a screeching halt, that is certainly what I managed to do a few times on my test cluster. Secondly same reason as above, empty out more than the immediate need, so you don't have to deal with evictions for a while. Admittedly that is a lot less important (stressful for the cluster) with evictions than flushes. > > > > > Lastly, while we have perf counters like "tier_dirty", a gauge for > > dirty > and > > clean objects/bytes would be quite useful to me at least. > > > I agree it would be nice to have these as counters, I had to write a > diamond collector to scrape these figures out of "ceph df detail" > Thanks for mentioning that, I hadn't used the detail bit for ages. Turns out my ratio is about 50/50. Regards, Christian > > > And clearly the cache tier agent already has those numbers. > > Right now I'm guestimating that most of my cache objects are actually > clean > > (from VM reboots, only read, never written to), but I have no way to > > tell > for > > sure. > > > > Regards, > > > > Christian > > -- > > Christian Balzer Network/Systems Engineer > > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications > > http://www.gol.com/ > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com