Re: Cache tier operation clarifications

Nick Fisk <nick@xxxxxxxxxx> · Mon, 7 Mar 2016 08:30:52 -0000

Hi Christian,

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Christian Balzer
> Sent: 07 March 2016 02:22
> To: ceph-users <ceph-users@xxxxxxxxxxxxxx>
> Subject: Re:  Cache tier operation clarifications
> 
> 
> Hello,
> 
> I'd like to get some insights, confirmations from people here who are
either
> familiar with the code or have this tested more empirically than me (the
> VM/client node of my test cluster is currently pinning for the fjords).
> 
> When it comes to flushing/evicting we already established that this
triggers
> based on PG utilization, not a pool wide one.
> So for example in a pool with 1024TB capacity (set via target_max_bytes)
and
> 1024 PGs and a cache_target_dirty_ratio of 0.5 flushing will start when
the
> first PG reaches 512MB utilization.
> 
> However while the documentation states that the least recently objects are
> evicted when things hit the cache_target_full_ratio, it is less than clear
> (understatement of the year) when flushing is concerned.
> To quote:
> "When the cache pool consists of a certain percentage of modified (or
> dirty) objects, the cache tiering agent will flush them to the storage
pool."
> 
> How do we read this?
> When hitting 50% (as in the example above) all of the dirty objects will
get
> flushed?
> That doesn't match what I'm seeing nor would it be a sensible course of
> action to unleash such a potentially huge torrent of writes.
> 
> If we interpret this as "get the dirty objects below the threshold" (which
is
> what seems to happen) there are 2 possible courses of action here:
> 
> 1. Flush dirty object(s) from the PG that has reached the threshold.
> A sensible course of action in terms of reducing I/Os, but it may keep
flushing
> the same objects over and over again if they happen to be on the "full"
PG.

I think this is how it works. The agents/hitsets work at the per PG level
and the flushing code is very closely linked. I can't be 100% sure, but I'm
90%+ sure.

https://github.com/ceph/ceph/blob/master/src/osd/ReplicatedPG.cc#L11967

It uses that cache_min_flush_age variable to check if the object is old
enough to be flushed, but I can't see any logic as to how it selects objects
in the first place. It almost looks like it just cycles through all the
objects in order, it would be nice to have this confirmed.

In releases after Hammer there are two thresholds that flush at different
speeds. This can help as

1. It means that at the low threshold it uses less IO to flush
2. Between low and high thresholds the cache effectively cleans itself down
to the low threshold during idle periods. So it's ready to absorb  bursts of
writes when your workloads get busy.

You need to play around with the max_agent_ops variable for both which
controls how many concurrent flushes can occur, so that during normal
behaviour the % dirty is somewhere between the low and high thresholds.

Although at the moment none of this is accessible to you, until you upgrade
to Jewel in the future.

> 
> 2. Flush dirty objects from all PGs (most likely in a least recently used
> fashion) and stop when we're eventually under the threshold by having
> finally hit the "full" PG.
> Results in a lot more IO but will of course create more clean objects
available
> for eviction if needed.
> This is what I think is happening.
> 
> So, is there any "least recently used" consideration in effect here, or is
the
> only way to avoid (pointless) flushes by setting "cache_min_flush_age"
> accordingly?
> 
> Unlike for flushes above, eviction clearly states that it's going by
"least
> recently used".
> Which in the case of per PG operation would violate that promise, as
people
> of course expect this to be pool wide.
> And if it is indeed pool wide, the same effect as above will happen,
evictions
> will happen until the "full" PG gets hit, evicting far more than would
have
> been needed.
> 
> 
> Something to maybe consider would be a target value, for example with
> "cache_target_full_ratio" at 0.80 and "cache_target_full_ratio_target" at
> 0.78, evicting things until it reaches the target ratio.

How is that any different from target_max_bytes (which is effectively 1.0)
and cache_target_full_ratio = 0.8?

> 
> Lastly, while we have perf counters like "tier_dirty", a gauge for dirty
and
> clean objects/bytes would be quite useful to me at least.

I agree it would be nice to have these as counters, I had to write a diamond
collector to scrape these figures out of "ceph df detail"

> And clearly the cache tier agent already has those numbers.
> Right now I'm guestimating that most of my cache objects are actually
clean
> (from VM reboots, only read, never written to), but I have no way to tell
for
> sure.
> 
> Regards,
> 
> Christian
> --
> Christian Balzer        Network/Systems Engineer
> chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com