Re: Local SSD cache for ceph on each compute node.

Sage Weil <sage@xxxxxxxxxxxx> · Tue, 29 Mar 2016 08:42:36 -0400 (EDT)

On Tue, 29 Mar 2016, Ric Wheeler wrote:
> > However, if the write cache would would be "flushed in-order" to Ceph 
> > you would just lose x seconds of data and, hopefully, not have a 
> > corrupted disk. That could be acceptable for some people. I was just 
> > stressing that that isnʼt the case.
> 
> This in order assumption - speaking as some one who has a long history 
> in kernel file and storage - is the wrong assumption.
> 
> Don't think of the cache device and RBD as separate devices, once they 
> are configured like this, they are the same device from the point of 
> view of the file system (or whatever) that runs on top of them.
> 
> The cache and its caching policy can vary, but it is perfectly 
> reasonable to have data live only in that caching layer pretty much 
> forever. Local disk caches can also do this by the way :)

That's true for current caching devices like dm-cache, but does not need 
to be true--and I think that's what Robert is getting at.  The plan for 
RBD, for example, is to implement a client-side cache that has an ordered 
writeback policy, similar to the one described in this paper:

 https://www.usenix.org/conference/fast13/technical-sessions/presentation/koller

In that scenario, loss of the cache devices leaves you with a stale but 
crash-consistent image on the base device.

> The whole flushing order argument is really not relevant here. I could 
> "flush in order" after a minute, a week or a year. If the cache is large 
> enough, you might have zero data land on the backing store (even if the 
> destage policy does it as you suggest as in order).

I think the assumption is you would place a bound on the amount of dirty 
data in the cache.  Since you need to checkpoint cache content (on, say, 
flush boundaries), that roughly bounds the size of the cache by the amount 
of data written, even if it is repeatedly scribbling over the same blocks.

> That all said, the reason to use a write cache on top of client block 
> device - rbd or other - is to improve performance for the client.
> 
> Any time we make our failure domain require fully operating two devices 
> (the cache device and the original device), we increase the probability 
> of a non-recoverable failure.  In effect, the reliability of the storage 
> is at best as reliable as the least reliable part of the pair.

The goal is to add a new choice on the spectrum between (1) all writes are 
replicated across the cluster in order to get a consistent and up-to-date 
image when the client+cache fail, and (2) a writeback that gives you fast 
writes but leaves you with a corrupted (and stale) image after such a 
failer.  Ordered writeback (1.5) gives you low write latency and a stale 
but crash consistent image.  I suspect this will be a sensible choice for 
a lot of different use cases and workloads.

Is anything like this on the dm-cache roadmap as well?  It's probably less 
useful when the cache device lives in the same host (compared to a 
client/cluster arrangement more typical of RBD where a client host failure 
takes out the cache device but not the base image), but it might still be 
worth considering.

sage
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com