Re: Cache tier operation clarifications

Christian Balzer <chibi@xxxxxxx> · Mon, 7 Mar 2016 22:35:00 +0900

Hello nick,

On Mon, 7 Mar 2016 08:30:52 -0000 Nick Fisk wrote:

> Hi Christian,
> 
> > -----Original Message-----
> > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
> > Of Christian Balzer
> > Sent: 07 March 2016 02:22
> > To: ceph-users <ceph-users@xxxxxxxxxxxxxx>
> > Subject: Re:  Cache tier operation clarifications
> > 
> > 
> > Hello,
> > 
> > I'd like to get some insights, confirmations from people here who are
> either
> > familiar with the code or have this tested more empirically than me
> > (the VM/client node of my test cluster is currently pinning for the
> > fjords).
> > 
> > When it comes to flushing/evicting we already established that this
> triggers
> > based on PG utilization, not a pool wide one.
> > So for example in a pool with 1024TB capacity (set via
> > target_max_bytes)
> and
> > 1024 PGs and a cache_target_dirty_ratio of 0.5 flushing will start when
> the
> > first PG reaches 512MB utilization.
> > 
> > However while the documentation states that the least recently objects
> > are evicted when things hit the cache_target_full_ratio, it is less
> > than clear (understatement of the year) when flushing is concerned.
> > To quote:
> > "When the cache pool consists of a certain percentage of modified (or
> > dirty) objects, the cache tiering agent will flush them to the storage
> pool."
> > 
> > How do we read this?
> > When hitting 50% (as in the example above) all of the dirty objects
> > will
> get
> > flushed?
> > That doesn't match what I'm seeing nor would it be a sensible course of
> > action to unleash such a potentially huge torrent of writes.
> > 
> > If we interpret this as "get the dirty objects below the
> > threshold" (which
> is
> > what seems to happen) there are 2 possible courses of action here:
> > 
> > 1. Flush dirty object(s) from the PG that has reached the threshold.
> > A sensible course of action in terms of reducing I/Os, but it may keep
> flushing
> > the same objects over and over again if they happen to be on the "full"
> PG.
> 
> I think this is how it works. The agents/hitsets work at the per PG level
> and the flushing code is very closely linked. I can't be 100% sure, but
> I'm 90%+ sure.
> 
> https://github.com/ceph/ceph/blob/master/src/osd/ReplicatedPG.cc#L11967
> 
I'm quite illiterate in C++, but yeah, that sesms to be happening here.

Which is not totally unsurprising. 

> 
> It uses that cache_min_flush_age variable to check if the object is old
> enough to be flushed, but I can't see any logic as to how it selects
> objects in the first place. It almost looks like it just cycles through
> all the objects in order, it would be nice to have this confirmed.
> 
This would mesh with no such logic being mentioned in the meager
documentation, mind.

> In releases after Hammer there are two thresholds that flush at different
> speeds. This can help as
> 
> 1. It means that at the low threshold it uses less IO to flush
> 2. Between low and high thresholds the cache effectively cleans itself
> down to the low threshold during idle periods. So it's ready to absorb
> bursts of writes when your workloads get busy.
> 
Yup, I know and the role you played in getting that implemented.

> You need to play around with the max_agent_ops variable for both which
> controls how many concurrent flushes can occur, so that during normal
> behaviour the % dirty is somewhere between the low and high thresholds.
>

This is indeed quite nice and a very good method to keep things in balance
for a cache/cluster that is pretty busy and getting filled/dirtied at a
high pace.

Our cache pool usage grows by less than 10% a week though, so for us
a constant trickle back to the HDD OSD base pool is less desirable than
flushing larger chunks in off-peak hours and have the base pool idle
during peak hours, ready to serve read requests at full speed.
This would also tie in well with read-forward cache mode.

That's why I'm pondering a ratio fondling cronjob on certain nights.

> Although at the moment none of this is accessible to you, until you
> upgrade to Jewel in the future.
> 
> > 
> > 2. Flush dirty objects from all PGs (most likely in a least recently
> > used fashion) and stop when we're eventually under the threshold by
> > having finally hit the "full" PG.
> > Results in a lot more IO but will of course create more clean objects
> available
> > for eviction if needed.
> > This is what I think is happening.
> > 
> > So, is there any "least recently used" consideration in effect here,
> > or is
> the
> > only way to avoid (pointless) flushes by setting "cache_min_flush_age"
> > accordingly?
> > 
> > Unlike for flushes above, eviction clearly states that it's going by
> "least
> > recently used".
> > Which in the case of per PG operation would violate that promise, as
> people
> > of course expect this to be pool wide.

Let me beat that concussed horse a bit more, I can see easily see a
scenario where a lot of busy objects wind up on the same PG and thus get
flushed/evicted while other PGs have far more stale data and never get
touched because they're small enough.

Unfortunate, inelegant and certainly not expected.

> > And if it is indeed pool wide, the same effect as above will happen,
> evictions
> > will happen until the "full" PG gets hit, evicting far more than would
> have
> > been needed.
> > 
> > 
> > Something to maybe consider would be a target value, for example with
> > "cache_target_full_ratio" at 0.80 and "cache_target_full_ratio_target"
> > at 0.78, evicting things until it reaches the target ratio.
> 
> How is that any different from target_max_bytes (which is effectively
> 1.0) and cache_target_full_ratio = 0.8?
> 
Firstly, my understanding is that reaching target_max_bytes brings things
to a screeching halt, that is certainly what I managed to do a few times
on my test cluster.

Secondly same reason as above, empty out more than the immediate need, so
you don't have to deal with evictions for a while. 

Admittedly that is a lot less important (stressful for the cluster) with
evictions than flushes.

> 
> > 
> > Lastly, while we have perf counters like "tier_dirty", a gauge for
> > dirty
> and
> > clean objects/bytes would be quite useful to me at least.
> 
> 
> I agree it would be nice to have these as counters, I had to write a
> diamond collector to scrape these figures out of "ceph df detail"
> 

Thanks for mentioning that, I hadn't used the detail bit for ages.
Turns out my ratio is about 50/50.

Regards,

Christian

> 
> > And clearly the cache tier agent already has those numbers.
> > Right now I'm guestimating that most of my cache objects are actually
> clean
> > (from VM reboots, only read, never written to), but I have no way to
> > tell
> for
> > sure.
> > 
> > Regards,
> > 
> > Christian
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com