On Tue, 13 Jan 2015, Pavan Rallabhandi wrote: > Hi, > > This is regarding cache pools and the impact of the flush/evict on the > client IO latencies. > > Am seeing a direct impact on the client IO latencies (making them worse) > when flush/evict is triggered on the cache pool. In a constant ingress > of IOs on the cache pool, the write performance is no better than > without cache pool, because it is limited to the speed at which objects > can be flushed/evicted to the backend pool. Yeah, this is always going to be true in general. It is a lot for work to write into the cache, read it back, write it again into the base pool, and then delete it from the cache than it is to write directly to the base pool. > > The questions I have are: > > 1) When the flush/evict is in progress, are the writes on the cache pool > blocked, either at the PG or at object granularity? Though I see a > blocking flag honored per object context in ReplicatedPG::start_flush() > and most of the callers seem to set the flag to be false. Normally they are not blocked. The agent starts working (finding objects to flush or evict) long before we hit the cut cutoff where it starts blocking. Once it does hit that threshold, though, things can get slow, because new cache creates aren't allowed until some eviction completes. You don't want to be in this situation. :) In general, if you have a lot of data inject, caching (at least in firefly) isn't a terribly good idea. The exception would probably be when you have a high skew toward recent data (say you are injecting market data, and do tons of analytics on the last 24 hours, but then the data gets colder). I can't tell if you're in the situation where the cache pool is full and the agent is flushing/evicing anything and everything and writes are crawling (you should see a message in 'ceph health' when this happens) or that the agent is alive but working with low effort and the impact is still high. If it's the latter I'm not sure yet what is going wrong.. perhaps you can capture a few minutes of log from one of your OSDs? (debug ms = 1, debug osd = 20). > 2) Is there any mechanism (that I might have overlooked) to avoid this > situation, by throttling the flush/evict operations on the fly? If not, > shouldn't there be one? Hmm, we could have a 'noagent' option (similar to noout, nobackfill, noscrub, etc.) that lets the admin tell the system to stop tiering movements, but I'm not sure that's wht you're asking for... sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html