RE: cache tiering sessions at CDS

Sage Weil <sage@xxxxxxxxxxxx> · Mon, 3 Nov 2014 06:11:49 -0800 (PST)

On Mon, 3 Nov 2014, Wang, Zhiqiang wrote:
> Sage,
> 
> Thank you for summarize this. I still have some questions and want to 
> clarify with you.
> 
> For the wip-promote-forward, you said that it can't guarantee ordering. 
> Does this problem exists in the proxy and promote approach? If not, how 
> does the cache tier osd preserve the ordering? According to my 
> understanding, a simple scenario can be like this: a read comes in, the 
> cache tier osd proxies it. The pg lock is unlocked after sending the 
> proxy request to base tier. A subsequent read request of the same object 
> comes in. Somehow the cache tier osd just finishes the promotion for 
> this object. And the 2nd read request is served at the cache tier. It 
> may respond to the client first than the 1st read request.

That is one scenario, although it is one we could solve by requeuing all 
in-progress proxy reads in the cache tier when promote completes.  The 
problem I was alluding to is with redirects, where we reply to the client 
and say "go look over there".  Then you can get

 - read 1 arrives at cache osd; redirect sent to client
 - osd initiates promote
 - promote completes
 - read 2 arrives at cache osd, replies
 - read 1 arrives at base osd, replies
 - client gets read 2 ack, then read 1 ack

or something similar.

I think the best answers to all of this is to not guarantee that reads 
will complete in order (only do that for writes).  We'll need to take a 
careful look at current uesrs to make sure it is safe to relax that, 
though, and probably make it possible to request that reads *are* ordered 
(similar to how we can request a particular read is ordered with respect 
to writes via the RWORDERED flag).

But in any case, with proxying, it's still possible to maintain the 
ordering with a bit of effort.

> If you don't already have someone working on this, definitely we can 
> help on this. We can come back with the detail implementation later.

That would be great!  Nobody is working on this yet.  I started the 
wip-promote-forward branch, but I think I didn't use the best approach 
because we should be divorcing the promote from the triggering request(s) 
entirely.

But it probably makes more sense to first implement that read 
proxying as that is a prerequisite and useful on its own...

sage

 > 
> -----Original Message-----
> From: Sage Weil <sage@xxxxxxxxxxx>
> Date: 2014-11-01 5:01 GMT+08:00
> Subject: cache tiering sessions at CDS
> To: ceph-devel@xxxxxxxxxxxxxxx
> 
> 
> There were a pair of CDS sessions Wednesday on cache tiering that prompted a great discussion about the current performance problems we're seeing and ways to address them.  It was a long discussion but I'll do my best to summarize.  Please chime in if I miss anything or if you disagree with my conclusions!
> 
> The first session was about fine-grained promotion.  I.e., promoting or storing only portions of an object in the cache.  Currently an object always exists in its entirely in the cache tier, but the latency from promotion can be expensive if the original write is small.
> 
> Sam and I generally agreed that there are advantages to doing this, but that the implementation will be quite complex.  There are also several simpler improvements that can be made that address many (most?) of the problematic workload patterns and are significantly simpler.
> 
> -- Reads --
> 
> Currently we either forward a read (decline to promote) or block a read while we promote.  Doing more declining (e.g. promote on 2nd read) is shown to help, but we should be able to do a lot better.
> 
> The first step is wip-promote-forward, or something similar, which forwards the read *and* initiates a promote.  The the original IO isn't delayed, only subsequent reads that arrive shortly after.
> 
> Second, even those subsequent reads need not wait for a promote: we can safely forward them too while promotion is in progress without breaking consistency from the client's perspective, as long as we preserve the order of reads and writes for each client.
> 
> 9979 osd: cache: proxy reads (instead of redirect)
> 9980 osd: cache: proxy reads during promote
> 
> Note: I belive there is some ordering problems with *redirecting* reads and then stopping (e.g., redirect, start promote, finish promote, read from cache ... the second read reply could reach the client before the first).  We may need to proxy in general?  :/
> 
> Anyway, proxying reads during promotion effectively makes the promotion asynchronous and transparent to the read workload, modulo the extra IO that the cache and base tiers will do (competition for network and disk IO).  I believe this will mitigate most of the impact on reads.
> 
> More importantly, it is least as good as the more complicated proposal of satisfying the read from the intermediate promotion result before it is written into the cache tier.  In particular, I think the *only* time using the intermediate promote result is better is when the read falls entirely within the current in-flight copy-get operation (in flight to the base tier, or in the process of being written to the cache but still in memory).  Any other time (unaligned read, read arrives before promote
> starts) it's better to proxy it.
> 
> Also, note that it is mainly small reads that we care about.  We expect large reads to be less frequent and, when they happen, to be generally okay with sending those to the base tier anyway.
> 
> -> Stategies that hide promotion cost are probably more useful than
> strategies that promote less (or partial) object data.
> 
> -- Writes --
> 
> The situation for writes is a bit more complex.  First, if we add the ability to proxy writes to the backend, we give ourselves the ability to decide if/when to promote (currently we unconditionally promote on write).
> This would allow a 'promote on 2nd write' type of behavior (similar to what we did for read).
> 
> We talked about the possibility of combining the small write into the promotion's write of the full object into the cache.  Since these are currently pipelined, it is not clear that this will improve things very much.  Promoting only object metadata and writing a partial bit of data into the cache tier is the big win, but it's complex, and we should do all the simple things (like write proxying) first.
> 
> Finally, we talked about making a write-full on an object skip the data portion of the promote.  This is only moderately complex and seems doable.
> However, it would be helpful to know how frequent write_full is in real workloads first.  Also, a write_full is arguably the type of operation where we might decline to promote at all, and simply proxy the write back to the base tier.
> 
> I think in the short term, the next step should be:
> 
> 9981 osd: cache: proxy writes (instead of unconditionally promoting)
> 
> -- read-only cache --
> 
> Finally, we brought up the idea of a read-only cache tier:
> 
>  - reads would promote (or not) just as they do now
>  - writes would invalidate (delete object from cache) and then forward/proxy
> 
> h/t to Dan Lambright for that suggestion.  Note that we already have a readonly cache mode; the delta here is how we handle the writes.
> 
> 9982 osd: cache: make writes in readonly mode invalidate and then forward
> 
> 
> There was a lot of discussion here so if you're interested you way want to check the pads or watch the videos.
> 
> http://pad.ceph.com/p/hammer-osd_tiering_promotion_unit
> http://pad.ceph.com/p/hammer-osd_tiering_latencies_cache_tier_miss
> http://youtu.be/7p8ZkOIJjUA
> http://youtu.be/AGDOnJFffrc
> 
> 
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> N?????r??y??????X???v???)?{.n?????z?]z????ay?????j??f???h??????w??????j:+v???w????????????zZ+???????j"????i