Musings

greg@xxxxxxxxxxx (Gregory Farnum) · Tue, 19 Aug 2014 10:34:04 -0700

On Thu, Aug 14, 2014 at 12:40 PM, Robert LeBlanc <robert at leblancnet.us> wrote:
> We are looking to deploy Ceph in our environment and I have some musings
> that I would like some feedback on. There are concerns about scaling a
> single Ceph instance to the PBs of size we would use, so the idea is to
> start small like once Ceph cluster per rack or two.

There are many groups running cluster >1PB, but whatever makes you
comfortable. There is a bit more of a learning curve once you reach a
certain scale than there is with smaller installations.

> Then as we feel more
> comfortable with it, then expand/combine clusters into larger systems. I'm
> not sure that it is possible to combine discrete Ceph clusters. It also
> seems to make sense to build a CRUSH map that defines regions, data centers,
> sections, rows, racks, and hosts now so that there is less data migration
> later, but I'm not sure how a merge would work.

Yeah, there's no merging of Ceph clusters and I don't think there ever
will be. Setting up the CRUSH maps this way to start, and only having
a single entry for most of the levels, would work just fine though.

>
> I've been also toying with the idea of SSD journal per node verses SSD cache
> tier pool verses lots of RAM for cache. Based on the performance webinar
> today, it seems that cache misses in the cache pool causes a lot of writing
> to the cache pool and severely degrades performance. I certainly like the
> idea of a heat map that way a single read of an entire VM (backup, rsync)
> won't kill the cache pool.

Yeah, there is very little world Ceph experience with cache pools, and
there's a lot working with an SSD journal + hard drive backing store;
I'd start with that.

> I've also been bouncing the idea to have data locality by configuring the
> CRUSH map to keep two of the three replicas within the same row and the
> third replica just somewhere in the data center. Based on a conversation on
> the IRC a couple of days ago, it seems that this could work very will if
> min_size is 2. But the documentation and the objective of Ceph seems to
> indicate that min_size only applies in degraded situations. During normal
> operation a write would have to be acknowledged by all three replicas before
> being returned to the client, otherwise it would be eventually consistent
> and not strongly consistent (I do like the idea of eventually consistent for
> replication as long as we can be strongly consistent in some form at the
> same time like 2 out of 3).

Yeah, no async replication at all for generic workloads. You can do
the "2 my rack and one in a different rack" thing just fine, although
it's a little tricky to set up. (There are email threads about this
that hopefully you can find; I've been part of one of them.) The
min_size is all about preserving a minimum resiliency of *every* write
(if a PG's replication is degraded but not yet repaired); if you had a
2+1 setup then min_size of 2 would just make sure there are at least
two copies somewhere (but not that they're in different racks or
whatever).
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com