Musings

robert@xxxxxxxxxxxxx (Robert LeBlanc) · Tue, 19 Aug 2014 13:53:42 -0600

Thanks, your responses have been helpful.

On Tue, Aug 19, 2014 at 1:48 PM, Gregory Farnum <greg at inktank.com> wrote:

> On Tue, Aug 19, 2014 at 11:18 AM, Robert LeBlanc <robert at leblancnet.us>
> wrote:
> > Greg, thanks for the reply, please see in-line.
> >
> >
> > On Tue, Aug 19, 2014 at 11:34 AM, Gregory Farnum <greg at inktank.com>
> wrote:
> >>
> >>
> >> There are many groups running cluster >1PB, but whatever makes you
> >> comfortable. There is a bit more of a learning curve once you reach a
> >> certain scale than there is with smaller installations.
> >
> >
> > What do you find to be the most difficult issues at large scale? It may
> help
> > ease some of the concerns if we know what we can expect.
>
> Well, I'm a developer, not a maintainer, so I'm probably the wrong
> person to ask about what surprises people. But in general it's stuff
> like:
> 1) Tunable settings matter more
> 2) Behavior that was unfortunate but left the cluster alive in a small
> cluster (eg, you have a bunch of slow OSDs that keep flapping) could
> turn into a data non-availability event in a large one (because with
> that many more OSDs misbehaving it overwhelms the monitors or
> something)
> 3) Resource consumption limits start popping up (eg, fd and pid limits
> need to be increased)
>
> Things like that. These are generally a matter of admin education at
> this scale (the code issues are fairly well sorted-out by now,
> although there were plenty of those to be found on the first
> multi-petabyte-scale cluster).
>
> >
> >> Yeah, there's no merging of Ceph clusters and I don't think there ever
> >> will be. Setting up the CRUSH maps this way to start, and only having
> >> a single entry for most of the levels, would work just fine though.
> >
> >
> > Thanks for confirming my suspicions. If we start with a CRUSH map
> designed
> > well, we can probably migrate the data outside of Ceph and just grow one
> > system and as the other empy, reformat them and bring them in.
> >
> >> Yeah, there is very little world Ceph experience with cache pools, and
> >> there's a lot working with an SSD journal + hard drive backing store;
> >> I'd start with that.
> >
> >
> > Other thoughts are using something like bcache or dm-cache on each OSD.
> > bcache is tempting because a single SSD device can serve multiple disks
> > where dm-cache has to have a separate SSD device/partition for each disk
> > (plus metadata). I plan on testing this unless someone says that it is
> > absolutely not worth the time.
> >
> >>
> >> Yeah, no async replication at all for generic workloads. You can do
> >> the "2 my rack and one in a different rack" thing just fine, although
> >> it's a little tricky to set up. (There are email threads about this
> >> that hopefully you can find; I've been part of one of them.) The
> >> min_size is all about preserving a minimum resiliency of *every* write
> >> (if a PG's replication is degraded but not yet repaired); if you had a
> >> 2+1 setup then min_size of 2 would just make sure there are at least
> >> two copies somewhere (but not that they're in different racks or
> >> whatever).
> >
> >
> > The current discussion in the office is if the cluster (2+1) is HEALTHY,
> > does the write return after 2 of the OSDs (itself and one replica)
> complete
> > the write or only after all three have completed the write? We are
> planning
> > to try to do some testing on this as well if a clear answer can't be
> found.
>
> It's only after all three have completed the write. Every write to
> Ceph is replicated synchronously to every OSD which is actively
> hosting the PG that the object resides in.
> -Greg
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140819/6c14e5da/attachment.htm>