On Tue, 29 Mar 2016, Ric Wheeler wrote:
> > However, if the write cache would would be "flushed in-order" to Ceph
> > you would just lose x seconds of data and, hopefully, not have a
> > corrupted disk. That could be acceptable for some people. I was just
> > stressing that that isnʼt the case.
>
> This in order assumption - speaking as some one who has a long history
> in kernel file and storage - is the wrong assumption.
>
> Don't think of the cache device and RBD as separate devices, once they
> are configured like this, they are the same device from the point of
> view of the file system (or whatever) that runs on top of them.
>
> The cache and its caching policy can vary, but it is perfectly
> reasonable to have data live only in that caching layer pretty much
> forever. Local disk caches can also do this by the way :)
That's true for current caching devices like dm-cache, but does not need
to be true--and I think that's what Robert is getting at. The plan for
RBD, for example, is to implement a client-side cache that has an ordered
writeback policy, similar to the one described in this paper:
https://www.usenix.org/conference/fast13/technical-sessions/presentation/koller
In that scenario, loss of the cache devices leaves you with a stale but
crash-consistent image on the base device.
> The whole flushing order argument is really not relevant here. I could
> "flush in order" after a minute, a week or a year. If the cache is large
> enough, you might have zero data land on the backing store (even if the
> destage policy does it as you suggest as in order).
I think the assumption is you would place a bound on the amount of dirty
data in the cache. Since you need to checkpoint cache content (on, say,
flush boundaries), that roughly bounds the size of the cache by the amount
of data written, even if it is repeatedly scribbling over the same blocks.
> That all said, the reason to use a write cache on top of client block
> device - rbd or other - is to improve performance for the client.
>
> Any time we make our failure domain require fully operating two devices
> (the cache device and the original device), we increase the probability
> of a non-recoverable failure. In effect, the reliability of the storage
> is at best as reliable as the least reliable part of the pair.
The goal is to add a new choice on the spectrum between (1) all writes are
replicated across the cluster in order to get a consistent and up-to-date
image when the client+cache fail, and (2) a writeback that gives you fast
writes but leaves you with a corrupted (and stale) image after such a
failer. Ordered writeback (1.5) gives you low write latency and a stale
but crash consistent image. I suspect this will be a sensible choice for
a lot of different use cases and workloads.
Is anything like this on the dm-cache roadmap as well? It's probably less
useful when the cache device lives in the same host (compared to a
client/cluster arrangement more typical of RBD where a client host failure
takes out the cache device but not the base image), but it might still be
worth considering.
sage
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com