Hello, On Thu, 18 Dec 2014 16:12:09 -0800 Craig Lewis wrote: Firstly I'd like to confirm what Craig said about small clusters. I just changed my four storage node test cluster from 1 OSD per node to 4 and it can now saturate a 1GbE link (110MB/s) where before it peaked at 50-60MB/s. Of course now it is CPU bound and a bit tight on memory (those nodes have 4GB RAM and 2 have just 1 CPU/core). ^o^ > I think this is it: > https://engage.redhat.com/inktank-ceph-reference-architecture-s-201409080939 > Ah, the joys of corporate address packratting. > You can also check out a presentation on Cern's Ceph cluster: > http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern > > > At large scale, the biggest problem will likely be network I/O on the > inter-switch links. > While true I think it will hit an equilibrium of sorts, if you actually have enough client traffic to saturate those, time for an upgrade. Aside from mere technical questions and challenges of scaling Ceph to those sizes (tuning all sorts of parameters, etc) I think clusters of that scale can become an administrative nightmare first and foremost. Let's take a look at a "classic" Ceph cluster with 10,000 OSDs: It will have somewhere between 500 and 1000 nodes. That number should give you pause already, there are bound to be dead nodes frequently. And with 10,000 disks, you're pretty much guaranteed to have a dead OSD or more (see the various threads about how resilient Ceph is) at any given time. So you'll need a team of people to swap disks on a constant/regular basis. And unless you also have a very nice inventory and tracking system, you will want to replace these OSDs in "order", so that OSD 10 isn't on node 50 all of sudden, etc. There's probably a point of diminishing return when increasing OSDs stops making sense for various reasons. In fact once you reach a few hundred OSDs, to ease maintenance consider RAIDed OSDs (no more failed OSDs, yeah! ^o^). For me, the life cycle of a steadily growing cluster would be something like this: 1. Start with as many nodes/OSDs as you can can afford for performance, even if you don't need the space yet. 2. Keep adding OSDs to satisfy space and performance requirements as needed. 3. While performance is still good (or can't improve because of network limitations) but space requirements increase, grow the size of your OSDs, not the number. Regards, Christian > > > On Thu, Dec 18, 2014 at 3:29 PM, Robert LeBlanc <robert@xxxxxxxxxxxxx> > wrote: > > > > I'm interested to know if there is a reference to this reference > > architecture. It would help alleviate some of the fears we have about > > scaling this thing to a massive scale (10,000's OSDs). > > > > Thanks, > > Robert LeBlanc > > > > On Thu, Dec 18, 2014 at 3:43 PM, Craig Lewis > > <clewis@xxxxxxxxxxxxxxxxxx> wrote: > > > >> > >> > >> On Thu, Dec 18, 2014 at 5:16 AM, Patrick McGarry <patrick@xxxxxxxxxxx> > >> wrote: > >>> > >>> > >>> > 2. What should be the minimum hardware requirement of the server > >>> > (CPU, Memory, NIC etc) > >>> > >>> There is no real "minimum" to run Ceph, it's all about what your > >>> workload will look like and what kind of performance you need. We > >>> have seen Ceph run on Raspberry Pis. > >> > >> > >> Technically, the smallest cluster is a single node with a 10 GiB disk. > >> Anything smaller won't work. > >> > >> That said, Ceph was envisioned to run on large clusters. IIRC, the > >> reference architecture has 7 rows, each row having 10 racks, all full. > >> > >> Those of us running small clusters (less than 10 nodes) are noticing > >> that it doesn't work quite as well. We have to significantly scale > >> back the amount of backfilling and recovery that is allowed. I try > >> to keep all backfill/recovery operations touching less than 20% of my > >> OSDs. In the reference architecture, it could lose a whole row, and > >> still keep under that limit. My 5 nodes cluster is noticeably better > >> better than the 3 node cluster. It's faster, has lower latency, and > >> latency doesn't increase as much during recovery operations. > >> > >> _______________________________________________ > >> ceph-users mailing list > >> ceph-users@xxxxxxxxxxxxxx > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > >> -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com