Musings

robert@xxxxxxxxxxxxx (Robert LeBlanc) · Tue, 19 Aug 2014 12:18:20 -0600

Greg, thanks for the reply, please see in-line.

On Tue, Aug 19, 2014 at 11:34 AM, Gregory Farnum <greg at inktank.com> wrote:

>
> There are many groups running cluster >1PB, but whatever makes you
> comfortable. There is a bit more of a learning curve once you reach a
> certain scale than there is with smaller installations.
>

What do you find to be the most difficult issues at large scale? It may
help ease some of the concerns if we know what we can expect.

Yeah, there's no merging of Ceph clusters and I don't think there ever
> will be. Setting up the CRUSH maps this way to start, and only having
> a single entry for most of the levels, would work just fine though.
>

Thanks for confirming my suspicions. If we start with a CRUSH map designed
well, we can probably migrate the data outside of Ceph and just grow one
system and as the other empy, reformat them and bring them in.

Yeah, there is very little world Ceph experience with cache pools, and
> there's a lot working with an SSD journal + hard drive backing store;
> I'd start with that.
>

Other thoughts are using something like bcache or dm-cache on each OSD.
bcache is tempting because a single SSD device can serve multiple disks
where dm-cache has to have a separate SSD device/partition for each disk
(plus metadata). I plan on testing this unless someone says that it is
absolutely not worth the time.

> Yeah, no async replication at all for generic workloads. You can do
> the "2 my rack and one in a different rack" thing just fine, although
> it's a little tricky to set up. (There are email threads about this
> that hopefully you can find; I've been part of one of them.) The
> min_size is all about preserving a minimum resiliency of *every* write
> (if a PG's replication is degraded but not yet repaired); if you had a
> 2+1 setup then min_size of 2 would just make sure there are at least
> two copies somewhere (but not that they're in different racks or
> whatever).
>

The current discussion in the office is if the cluster (2+1) is HEALTHY,
does the write return after 2 of the OSDs (itself and one replica) complete
the write or only after all three have completed the write? We are planning
to try to do some testing on this as well if a clear answer can't be found.

Thank you,
Robert LeBlanc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140819/4abd4bf3/attachment.htm>