On Thu, Aug 14, 2014 at 12:40 PM, Robert LeBlanc <robert at leblancnet.us> wrote: > We are looking to deploy Ceph in our environment and I have some musings > that I would like some feedback on. There are concerns about scaling a > single Ceph instance to the PBs of size we would use, so the idea is to > start small like once Ceph cluster per rack or two. There are many groups running cluster >1PB, but whatever makes you comfortable. There is a bit more of a learning curve once you reach a certain scale than there is with smaller installations. > Then as we feel more > comfortable with it, then expand/combine clusters into larger systems. I'm > not sure that it is possible to combine discrete Ceph clusters. It also > seems to make sense to build a CRUSH map that defines regions, data centers, > sections, rows, racks, and hosts now so that there is less data migration > later, but I'm not sure how a merge would work. Yeah, there's no merging of Ceph clusters and I don't think there ever will be. Setting up the CRUSH maps this way to start, and only having a single entry for most of the levels, would work just fine though. > > I've been also toying with the idea of SSD journal per node verses SSD cache > tier pool verses lots of RAM for cache. Based on the performance webinar > today, it seems that cache misses in the cache pool causes a lot of writing > to the cache pool and severely degrades performance. I certainly like the > idea of a heat map that way a single read of an entire VM (backup, rsync) > won't kill the cache pool. Yeah, there is very little world Ceph experience with cache pools, and there's a lot working with an SSD journal + hard drive backing store; I'd start with that. > I've also been bouncing the idea to have data locality by configuring the > CRUSH map to keep two of the three replicas within the same row and the > third replica just somewhere in the data center. Based on a conversation on > the IRC a couple of days ago, it seems that this could work very will if > min_size is 2. But the documentation and the objective of Ceph seems to > indicate that min_size only applies in degraded situations. During normal > operation a write would have to be acknowledged by all three replicas before > being returned to the client, otherwise it would be eventually consistent > and not strongly consistent (I do like the idea of eventually consistent for > replication as long as we can be strongly consistent in some form at the > same time like 2 out of 3). Yeah, no async replication at all for generic workloads. You can do the "2 my rack and one in a different rack" thing just fine, although it's a little tricky to set up. (There are email threads about this that hopefully you can find; I've been part of one of them.) The min_size is all about preserving a minimum resiliency of *every* write (if a PG's replication is degraded but not yet repaired); if you had a 2+1 setup then min_size of 2 would just make sure there are at least two copies somewhere (but not that they're in different racks or whatever). -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com