Musings

greg@xxxxxxxxxxx (Gregory Farnum) · Tue, 19 Aug 2014 12:48:15 -0700

On Tue, Aug 19, 2014 at 11:18 AM, Robert LeBlanc <robert at leblancnet.us> wrote:
> Greg, thanks for the reply, please see in-line.
>
>
> On Tue, Aug 19, 2014 at 11:34 AM, Gregory Farnum <greg at inktank.com> wrote:
>>
>>
>> There are many groups running cluster >1PB, but whatever makes you
>> comfortable. There is a bit more of a learning curve once you reach a
>> certain scale than there is with smaller installations.
>
>
> What do you find to be the most difficult issues at large scale? It may help
> ease some of the concerns if we know what we can expect.

Well, I'm a developer, not a maintainer, so I'm probably the wrong
person to ask about what surprises people. But in general it's stuff
like:
1) Tunable settings matter more
2) Behavior that was unfortunate but left the cluster alive in a small
cluster (eg, you have a bunch of slow OSDs that keep flapping) could
turn into a data non-availability event in a large one (because with
that many more OSDs misbehaving it overwhelms the monitors or
something)
3) Resource consumption limits start popping up (eg, fd and pid limits
need to be increased)

Things like that. These are generally a matter of admin education at
this scale (the code issues are fairly well sorted-out by now,
although there were plenty of those to be found on the first
multi-petabyte-scale cluster).

>
>> Yeah, there's no merging of Ceph clusters and I don't think there ever
>> will be. Setting up the CRUSH maps this way to start, and only having
>> a single entry for most of the levels, would work just fine though.
>
>
> Thanks for confirming my suspicions. If we start with a CRUSH map designed
> well, we can probably migrate the data outside of Ceph and just grow one
> system and as the other empy, reformat them and bring them in.
>
>> Yeah, there is very little world Ceph experience with cache pools, and
>> there's a lot working with an SSD journal + hard drive backing store;
>> I'd start with that.
>
>
> Other thoughts are using something like bcache or dm-cache on each OSD.
> bcache is tempting because a single SSD device can serve multiple disks
> where dm-cache has to have a separate SSD device/partition for each disk
> (plus metadata). I plan on testing this unless someone says that it is
> absolutely not worth the time.
>
>>
>> Yeah, no async replication at all for generic workloads. You can do
>> the "2 my rack and one in a different rack" thing just fine, although
>> it's a little tricky to set up. (There are email threads about this
>> that hopefully you can find; I've been part of one of them.) The
>> min_size is all about preserving a minimum resiliency of *every* write
>> (if a PG's replication is degraded but not yet repaired); if you had a
>> 2+1 setup then min_size of 2 would just make sure there are at least
>> two copies somewhere (but not that they're in different racks or
>> whatever).
>
>
> The current discussion in the office is if the cluster (2+1) is HEALTHY,
> does the write return after 2 of the OSDs (itself and one replica) complete
> the write or only after all three have completed the write? We are planning
> to try to do some testing on this as well if a clear answer can't be found.

It's only after all three have completed the write. Every write to
Ceph is replicated synchronously to every OSD which is actively
hosting the PG that the object resides in.
-Greg