On 03/29/2016 03:42 PM, Sage Weil wrote:
On Tue, 29 Mar 2016, Ric Wheeler wrote:
However, if the write cache would would be "flushed in-order" to Ceph
you would just lose x seconds of data and, hopefully, not have a
corrupted disk. That could be acceptable for some people. I was just
stressing that that isnʼt the case.
This in order assumption - speaking as some one who has a long history
in kernel file and storage - is the wrong assumption.
Don't think of the cache device and RBD as separate devices, once they
are configured like this, they are the same device from the point of
view of the file system (or whatever) that runs on top of them.
The cache and its caching policy can vary, but it is perfectly
reasonable to have data live only in that caching layer pretty much
forever. Local disk caches can also do this by the way :)
That's true for current caching devices like dm-cache, but does not need
to be true--and I think that's what Robert is getting at. The plan for
RBD, for example, is to implement a client-side cache that has an ordered
writeback policy, similar to the one described in this paper:
https://www.usenix.org/conference/fast13/technical-sessions/presentation/koller
In that scenario, loss of the cache devices leaves you with a stale but
crash-consistent image on the base device.
Certainly if we design a distributed system aware caching layer, things can be
different.
I think that the trade offs for using a local to the client cache are certainly
a bit confusing for normal users, but they are pretty popular these days even
given the limits.
Enterprise storage systems (big EMC/Hitachi/etc arrays) have been effectively
implemented as distributed systems for quite a long time and these server local
caches are routinely used for clients for their virtual LUNs.
Adding a performance boosting cahing layer (especially for a virtual guest) to
use a local SSD I think has a lot of upsides even if it does not solve the
problem for the migrating device case.
The whole flushing order argument is really not relevant here. I could
"flush in order" after a minute, a week or a year. If the cache is large
enough, you might have zero data land on the backing store (even if the
destage policy does it as you suggest as in order).
I think the assumption is you would place a bound on the amount of dirty
data in the cache. Since you need to checkpoint cache content (on, say,
flush boundaries), that roughly bounds the size of the cache by the amount
of data written, even if it is repeatedly scribbling over the same blocks.
Keep in mind that the device mapper device is the device you see at the client -
when you flush, you are flushing to it. It is designed as a cache for a local
device, the fact that ceph rbd is under it (and has a distributed backend) is
not really of interest in the current generation at least.
Perfectly legal to have data live only in the SSD for example, not land in the
backing device.
How we bound and manage the cache and the life cycle of the data is something
that Joe and the device mapper people have been actively working on.
I don't think that ordering alone is enough for any local linux file system.
The promises made from the storage layer up to the kernel file system stack are
basically that any transaction we commit (using synchronize_cache or similar
mechanisms) is durable across a power failure. We don't have assumptions on
ordering with regards to other writes (i.e., when we care, we flush the world
which is a whole device sync, or sync and FUA (gory details of the multiple
incarnations of this live in the kernel Documentation subtree in
block/writeback_cache_control.txt).
That all said, the reason to use a write cache on top of client block
device - rbd or other - is to improve performance for the client.
Any time we make our failure domain require fully operating two devices
(the cache device and the original device), we increase the probability
of a non-recoverable failure. In effect, the reliability of the storage
is at best as reliable as the least reliable part of the pair.
The goal is to add a new choice on the spectrum between (1) all writes are
replicated across the cluster in order to get a consistent and up-to-date
image when the client+cache fail, and (2) a writeback that gives you fast
writes but leaves you with a corrupted (and stale) image after such a
failer. Ordered writeback (1.5) gives you low write latency and a stale
but crash consistent image. I suspect this will be a sensible choice for
a lot of different use cases and workloads.
Writeback mode with dm-cache does not give you corruption after a reboot or
power failure, etc. It only gives you problems when the physical card dies or we
try to use the underlying device without the caching device present.
The thing that makes dm-cache different than what this paper was designed for I
think is that the dm-cache is durable across power outages. As long as we come
back with the same stacked device (rbd + dm-cache layered device), it already
provides crash consistent images for file systems and applications that use
fsync() and friends correctly.
What it does not address is the death of the local physical media on the client
or trying to migrate the device to another host (and rebuild a cache there).
Is anything like this on the dm-cache roadmap as well? It's probably less
useful when the cache device lives in the same host (compared to a
client/cluster arrangement more typical of RBD where a client host failure
takes out the cache device but not the base image), but it might still be
worth considering.
I don't know of any plans on the device mapper team to tackle distributed
caching, but we could certainly bring it up with Joe and company (hint: easier
to do if people did not keep dropping them from the replies to the thread :))
Regards,
Ric
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com