Hello, Unlike the subject may suggest, I'm mostly going to try and explain how things work with cache tiers, as far as I understand them. Something of a reference to point to. Of course if you spot something that's wrong or have additional information, by all means please do comment. While the documentation in master now correctly warns that you HAVE to set target_max_bytes (the size of your cache pool) for any of the relative sizing bits to work, lets repeat that here since it wasn't mentioned there previously. And without that value being set, none of the flushing or eviction will happen, resulting in blocked IOs when it gets full. The other thing about target_max_bytes is to remember (documented nowhere) that this space calculation is base per PG. So if you have a 1024GB cache pool and target_max_bytes set accordingly (one of the most annoying things about Ceph is have to specify full bytes in most places instead of human friendly shortcuts like "1TB"), Ceph (the cache tiering agent to be precise) will think that the cache is 50% full when just one PG has reached 512MB. In short, expect things to happen quite a bit before you reach the usage that you think you specified in cache_target_dirty_ratio and cache_target_full_ratio. Annoying, but at least failing safe. I'm ignoring target_max_objects for this, as it's the same for object count instead of space. min_read_recency_for_promote and min_write_recency_for_promote I shall ignore for now as well, since I have no cluster to test them with. Flush Either way once Ceph thinks you've reached the cache_target_dirty_ratio specified, it copies dirty objects to the backing storage. If they never existed there before, they will be created (so keep that in mind if you see an increase in objects). This (additional object) is similar to tier promotion, when an existing object is copied from the base pool to the cache pool the first time it's accessed. In versions after Hammer there is also cache_target_dirty_high_ratio, which specifies at which point more aggressive flushing starts. Note that flushing keeps objects in the cache. So that object you wrote too some days ago and kept reading frequently ever since isn't just going away to the slower base pool. Evict Next is eviction. This is where things became bit more muddled for me and I had to do some testing and staring at objects in PGs. So your cache pool is now hitting the cache_target_full_ratio (or so the wonky space per PG algorithm thinks). Remember that all IO will stop once the cache pool gets 100% full, so you want this to happen at some safe, sane point before this. What that point is depends of course on the maximum write speed to your pool, how fast your cache can flush to the base pool, etc. Now here is the fun part, clean objects (ones that have not been modified since they were promoted from the base pool or last flushed) are eligible for eviction. When reading about this the first time I thought this involved more moving of data from the cache pool to the base pool. However what happens is that since the object is "clean" (copy exists on the base pool), it is simply zero'd (after demotion), leaving an empty rados object in the cache pool and consequently releasing space. So as far as IO and network traffic is concerned, your enemy is flushing, not eviction. In clusters that have a clear usage pattern and idle times, a command to trigger flushes for a specified ratio and with settable IO limits would be most welcome. (hint-hint) Lacking this for now, I've be pondering a cron job that sets cache_target_dirty_ratio from .7 (my current value) to .6 (or more likely something smaller, like .65) for a few hours during night and then back up again. This is based on our cache typically not growing more than 2% per day. Lastly we come to cache_min_flush_age and cache_min_evict_age. It is my understanding that in Hammer and later a truly full cache pool will cause these to be ignored to prevent IO deadlocks, correct? The largest source of cache pollution for us are VM reboots (all those objects holding the kernel and other things only read at startup, never to be needed again for months) while on the other hand we have about 10k truly hot objects that are constantly being read/written. Lacking min_write_recency_for_promote for now, I've been thinking to set cache_min_evict_age to several hours. Truly cold objects will be subject to eviction, even lukewarm ones get to stay. Note that for objects that more or less belong in the cache we're using less than 15% of its capacity. Christian -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com