Re: Local SSD cache for ceph on each compute node.

Nick Fisk <nick@xxxxxxxxxx> · Tue, 29 Mar 2016 14:35:35 +0100

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Ric Wheeler
> Sent: 29 March 2016 14:07
> To: Sage Weil <sage@xxxxxxxxxxxx>
> Cc: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  Local SSD cache for ceph on each compute node.
> 
> On 03/29/2016 03:42 PM, Sage Weil wrote:
> > On Tue, 29 Mar 2016, Ric Wheeler wrote:
> >>> However, if the write cache would would be "flushed in-order" to
> >>> Ceph you would just lose x seconds of data and, hopefully, not have
> >>> a corrupted disk. That could be acceptable for some people. I was
> >>> just stressing that that isn?t the case.
> >> This in order assumption - speaking as some one who has a long
> >> history in kernel file and storage - is the wrong assumption.
> >>
> >> Don't think of the cache device and RBD as separate devices, once
> >> they are configured like this, they are the same device from the
> >> point of view of the file system (or whatever) that runs on top of
them.
> >>
> >> The cache and its caching policy can vary, but it is perfectly
> >> reasonable to have data live only in that caching layer pretty much
> >> forever. Local disk caches can also do this by the way :)
> > That's true for current caching devices like dm-cache, but does not
> > need to be true--and I think that's what Robert is getting at.  The
> > plan for RBD, for example, is to implement a client-side cache that
> > has an ordered writeback policy, similar to the one described in this
paper:
> >
> >
> > https://www.usenix.org/conference/fast13/technical-sessions/presentati
> > on/koller
> >
> > In that scenario, loss of the cache devices leaves you with a stale
> > but crash-consistent image on the base device.
> 
> Certainly if we design a distributed system aware caching layer, things
can be
> different.
> 
> I think that the trade offs for using a local to the client cache are
certainly a bit
> confusing for normal users, but they are pretty popular these days even
> given the limits.
> 
> Enterprise storage systems (big EMC/Hitachi/etc arrays) have been
> effectively implemented as distributed systems for quite a long time and
> these server local caches are routinely used for clients for their virtual
LUNs.
> 
> Adding a performance boosting cahing layer (especially for a virtual
guest) to
> use a local SSD I think has a lot of upsides even if it does not solve the
> problem for the migrating device case.
> 
> >
> >> The whole flushing order argument is really not relevant here. I could
> >> "flush in order" after a minute, a week or a year. If the cache is
large
> >> enough, you might have zero data land on the backing store (even if the
> >> destage policy does it as you suggest as in order).
> > I think the assumption is you would place a bound on the amount of dirty
> > data in the cache.  Since you need to checkpoint cache content (on, say,
> > flush boundaries), that roughly bounds the size of the cache by the
amount
> > of data written, even if it is repeatedly scribbling over the same
blocks.
> 
> Keep in mind that the device mapper device is the device you see at the
> client -
> when you flush, you are flushing to it.  It is designed as a cache for a
local
> device, the fact that ceph rbd is under it (and has a distributed backend)
is
> not really of interest in the current generation at least.
> 
> Perfectly legal to have data live only in the SSD for example, not land in
the
> backing device.
> 
> How we bound and manage the cache and the life cycle of the data is
> something
> that Joe and the device mapper people have been actively working on.
> 
> I don't think that ordering alone is enough for any local linux file
system.
> 
> The promises made from the storage layer up to the kernel file system
stack
> are
> basically that any transaction we commit (using synchronize_cache or
similar
> mechanisms) is durable across a power failure. We don't have assumptions
> on
> ordering with regards to other writes (i.e., when we care, we flush the
world
> which is a whole device sync, or sync and FUA (gory details of the
multiple
> incarnations of this live in the kernel Documentation subtree in
> block/writeback_cache_control.txt).
> 
> >
> >> That all said, the reason to use a write cache on top of client block
> >> device - rbd or other - is to improve performance for the client.
> >>
> >> Any time we make our failure domain require fully operating two devices
> >> (the cache device and the original device), we increase the probability
> >> of a non-recoverable failure.  In effect, the reliability of the
storage
> >> is at best as reliable as the least reliable part of the pair.
> > The goal is to add a new choice on the spectrum between (1) all writes
are
> > replicated across the cluster in order to get a consistent and
up-to-date
> > image when the client+cache fail, and (2) a writeback that gives you
fast
> > writes but leaves you with a corrupted (and stale) image after such a
> > failer.  Ordered writeback (1.5) gives you low write latency and a stale
> > but crash consistent image.  I suspect this will be a sensible choice
for
> > a lot of different use cases and workloads.
> 
> Writeback mode with dm-cache does not give you corruption after a reboot
> or
> power failure, etc. It only gives you problems when the physical card dies
or
> we
> try to use the underlying device without the caching device present.
> 
> The thing that makes dm-cache different than what this paper was designed
> for I
> think is that the dm-cache is durable across power outages. As long as we
> come
> back with the same stacked device (rbd + dm-cache layered device), it
> already
> provides crash consistent images for file systems and applications that
use
> fsync() and friends correctly.
> 
> What it does not address is the death of the local physical media on the
client
> or trying to migrate the device to another host (and rebuild a cache
there).
> >
> > Is anything like this on the dm-cache roadmap as well?  It's probably
less
> > useful when the cache device lives in the same host (compared to a
> > client/cluster arrangement more typical of RBD where a client host
failure
> > takes out the cache device but not the base image), but it might still
be
> > worth considering.
> >
> 
> I don't know of any plans on the device mapper team to tackle distributed
> caching, but we could certainly bring it up with Joe and company (hint:
easier
> to do if people did not keep dropping them from the replies to the thread
:))

One thing I picked up on when looking at dm-cache for doing caching with
RBD's is that it wasn't really designed to be used as a writeback cache for
new writes, as in how you would expect a traditional writeback cache to
work. It seems all the policies are designed around the idea that writes go
to cache only if the block is already in the cache (through reads) or its
hot enough to promote. Although there did seem to be some tunables to alter
this behaviour, posts on the mailing list seemed to suggest this wasn't how
it was designed to be used. I'm not sure if this has been addressed since I
last looked at it though.

Depending on if you are trying to accelerate all writes, or just your "hot"
blocks, this may or may not matter. Even <1GB local caches can make a huge
difference to sync writes.

> 
> Regards,
> 
> Ric
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com