Thanks, your responses have been helpful. On Tue, Aug 19, 2014 at 1:48 PM, Gregory Farnum <greg at inktank.com> wrote: > On Tue, Aug 19, 2014 at 11:18 AM, Robert LeBlanc <robert at leblancnet.us> > wrote: > > Greg, thanks for the reply, please see in-line. > > > > > > On Tue, Aug 19, 2014 at 11:34 AM, Gregory Farnum <greg at inktank.com> > wrote: > >> > >> > >> There are many groups running cluster >1PB, but whatever makes you > >> comfortable. There is a bit more of a learning curve once you reach a > >> certain scale than there is with smaller installations. > > > > > > What do you find to be the most difficult issues at large scale? It may > help > > ease some of the concerns if we know what we can expect. > > Well, I'm a developer, not a maintainer, so I'm probably the wrong > person to ask about what surprises people. But in general it's stuff > like: > 1) Tunable settings matter more > 2) Behavior that was unfortunate but left the cluster alive in a small > cluster (eg, you have a bunch of slow OSDs that keep flapping) could > turn into a data non-availability event in a large one (because with > that many more OSDs misbehaving it overwhelms the monitors or > something) > 3) Resource consumption limits start popping up (eg, fd and pid limits > need to be increased) > > Things like that. These are generally a matter of admin education at > this scale (the code issues are fairly well sorted-out by now, > although there were plenty of those to be found on the first > multi-petabyte-scale cluster). > > > > >> Yeah, there's no merging of Ceph clusters and I don't think there ever > >> will be. Setting up the CRUSH maps this way to start, and only having > >> a single entry for most of the levels, would work just fine though. > > > > > > Thanks for confirming my suspicions. If we start with a CRUSH map > designed > > well, we can probably migrate the data outside of Ceph and just grow one > > system and as the other empy, reformat them and bring them in. > > > >> Yeah, there is very little world Ceph experience with cache pools, and > >> there's a lot working with an SSD journal + hard drive backing store; > >> I'd start with that. > > > > > > Other thoughts are using something like bcache or dm-cache on each OSD. > > bcache is tempting because a single SSD device can serve multiple disks > > where dm-cache has to have a separate SSD device/partition for each disk > > (plus metadata). I plan on testing this unless someone says that it is > > absolutely not worth the time. > > > >> > >> Yeah, no async replication at all for generic workloads. You can do > >> the "2 my rack and one in a different rack" thing just fine, although > >> it's a little tricky to set up. (There are email threads about this > >> that hopefully you can find; I've been part of one of them.) The > >> min_size is all about preserving a minimum resiliency of *every* write > >> (if a PG's replication is degraded but not yet repaired); if you had a > >> 2+1 setup then min_size of 2 would just make sure there are at least > >> two copies somewhere (but not that they're in different racks or > >> whatever). > > > > > > The current discussion in the office is if the cluster (2+1) is HEALTHY, > > does the write return after 2 of the OSDs (itself and one replica) > complete > > the write or only after all three have completed the write? We are > planning > > to try to do some testing on this as well if a clear answer can't be > found. > > It's only after all three have completed the write. Every write to > Ceph is replicated synchronously to every OSD which is actively > hosting the PG that the object resides in. > -Greg > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140819/6c14e5da/attachment.htm>