RE: cache tiering sessions at CDS

"Wang, Zhiqiang" <zhiqiang.wang@xxxxxxxxx> · Mon, 3 Nov 2014 08:32:40 +0000

Sage,

Thank you for summarize this. I still have some questions and want to clarify with you.

For the wip-promote-forward, you said that it can't guarantee ordering. Does this problem exists in the proxy and promote approach? If not, how does the cache tier osd preserve the ordering? According to my understanding, a simple scenario can be like this: a read comes in, the cache tier osd proxies it. The pg lock is unlocked after sending the proxy request to base tier. A subsequent read request of the same object comes in. Somehow the cache tier osd just finishes the promotion for this object. And the 2nd read request is served at the cache tier. It may respond to the client first than the 1st read request.

If you don't already have someone working on this, definitely we can help on this. We can come back with the detail implementation later.

-----Original Message-----
From: Sage Weil <sage@xxxxxxxxxxx>
Date: 2014-11-01 5:01 GMT+08:00
Subject: cache tiering sessions at CDS
To: ceph-devel@xxxxxxxxxxxxxxx

There were a pair of CDS sessions Wednesday on cache tiering that prompted a great discussion about the current performance problems we're seeing and ways to address them.  It was a long discussion but I'll do my best to summarize.  Please chime in if I miss anything or if you disagree with my conclusions!

The first session was about fine-grained promotion.  I.e., promoting or storing only portions of an object in the cache.  Currently an object always exists in its entirely in the cache tier, but the latency from promotion can be expensive if the original write is small.

Sam and I generally agreed that there are advantages to doing this, but that the implementation will be quite complex.  There are also several simpler improvements that can be made that address many (most?) of the problematic workload patterns and are significantly simpler.

-- Reads --

Currently we either forward a read (decline to promote) or block a read while we promote.  Doing more declining (e.g. promote on 2nd read) is shown to help, but we should be able to do a lot better.

The first step is wip-promote-forward, or something similar, which forwards the read *and* initiates a promote.  The the original IO isn't delayed, only subsequent reads that arrive shortly after.

Second, even those subsequent reads need not wait for a promote: we can safely forward them too while promotion is in progress without breaking consistency from the client's perspective, as long as we preserve the order of reads and writes for each client.

9979 osd: cache: proxy reads (instead of redirect)
9980 osd: cache: proxy reads during promote

Note: I belive there is some ordering problems with *redirecting* reads and then stopping (e.g., redirect, start promote, finish promote, read from cache ... the second read reply could reach the client before the first).  We may need to proxy in general?  :/

Anyway, proxying reads during promotion effectively makes the promotion asynchronous and transparent to the read workload, modulo the extra IO that the cache and base tiers will do (competition for network and disk IO).  I believe this will mitigate most of the impact on reads.

More importantly, it is least as good as the more complicated proposal of satisfying the read from the intermediate promotion result before it is written into the cache tier.  In particular, I think the *only* time using the intermediate promote result is better is when the read falls entirely within the current in-flight copy-get operation (in flight to the base tier, or in the process of being written to the cache but still in memory).  Any other time (unaligned read, read arrives before promote
starts) it's better to proxy it.

Also, note that it is mainly small reads that we care about.  We expect large reads to be less frequent and, when they happen, to be generally okay with sending those to the base tier anyway.

-> Stategies that hide promotion cost are probably more useful than
strategies that promote less (or partial) object data.

-- Writes --

The situation for writes is a bit more complex.  First, if we add the ability to proxy writes to the backend, we give ourselves the ability to decide if/when to promote (currently we unconditionally promote on write).
This would allow a 'promote on 2nd write' type of behavior (similar to what we did for read).

We talked about the possibility of combining the small write into the promotion's write of the full object into the cache.  Since these are currently pipelined, it is not clear that this will improve things very much.  Promoting only object metadata and writing a partial bit of data into the cache tier is the big win, but it's complex, and we should do all the simple things (like write proxying) first.

Finally, we talked about making a write-full on an object skip the data portion of the promote.  This is only moderately complex and seems doable.
However, it would be helpful to know how frequent write_full is in real workloads first.  Also, a write_full is arguably the type of operation where we might decline to promote at all, and simply proxy the write back to the base tier.

I think in the short term, the next step should be:

9981 osd: cache: proxy writes (instead of unconditionally promoting)

-- read-only cache --

Finally, we brought up the idea of a read-only cache tier:

 - reads would promote (or not) just as they do now
 - writes would invalidate (delete object from cache) and then forward/proxy

h/t to Dan Lambright for that suggestion.  Note that we already have a readonly cache mode; the delta here is how we handle the writes.

9982 osd: cache: make writes in readonly mode invalidate and then forward

There was a lot of discussion here so if you're interested you way want to check the pads or watch the videos.

http://pad.ceph.com/p/hammer-osd_tiering_promotion_unit
http://pad.ceph.com/p/hammer-osd_tiering_latencies_cache_tier_miss
http://youtu.be/7p8ZkOIJjUA
http://youtu.be/AGDOnJFffrc

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at  http://vger.kernel.org/majordomo-info.html

��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f