Re: Need help from Ceph experts

Christian Balzer <chibi@xxxxxxx> · Fri, 19 Dec 2014 10:14:55 +0900

Hello,

On Thu, 18 Dec 2014 16:12:09 -0800 Craig Lewis wrote:

Firstly I'd like to confirm what Craig said about small clusters.
I just changed my four storage node test cluster from 1 OSD per node to 4
and it can now saturate a 1GbE link (110MB/s) where before it peaked at
50-60MB/s. Of course now it is CPU bound and a bit tight on memory (those
nodes have 4GB RAM and 2 have just 1 CPU/core). ^o^

> I think this is it:
> https://engage.redhat.com/inktank-ceph-reference-architecture-s-201409080939
>
Ah, the joys of corporate address packratting. 

> You can also check out a presentation on Cern's Ceph cluster:
> http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern
> 
> 
> At large scale, the biggest problem will likely be network I/O on the
> inter-switch links.
> 
While true I think it will hit an equilibrium of sorts, if you actually
have enough client traffic to saturate those, time for an upgrade.

Aside from mere technical questions and challenges of scaling Ceph to
those sizes (tuning all sorts of parameters, etc) I think clusters of that
scale can become an administrative nightmare first and foremost.

Let's take a look at a "classic" Ceph cluster with 10,000 OSDs:
It will have somewhere between 500 and 1000 nodes. That number should give
you pause already, there are bound to be dead nodes frequently.
And with 10,000 disks, you're pretty much guaranteed to have a dead OSD or
more (see the various threads about how resilient Ceph is) at any given
time.
So you'll need a team of people to swap disks on a constant/regular basis.
And unless you also have a very nice inventory and tracking system, you
will want to replace these OSDs in "order", so that OSD 10 isn't on node
50 all of sudden, etc.

There's probably a point of diminishing return when increasing OSDs
stops making sense for various reasons.
In fact once you reach a few hundred OSDs, to ease maintenance consider
RAIDed OSDs (no more failed OSDs, yeah! ^o^).

For me, the life cycle of a steadily growing cluster would be something
like this:
1. Start with as many nodes/OSDs as you can can afford for performance,
even if you don't need the space yet. 
2. Keep adding OSDs to satisfy space and performance requirements as
needed.
3. While performance is still good (or can't improve because of network
limitations) but space requirements increase, grow the size of your OSDs,
not the number.

Regards,

Christian
> 
> 
> On Thu, Dec 18, 2014 at 3:29 PM, Robert LeBlanc <robert@xxxxxxxxxxxxx>
> wrote:
> >
> > I'm interested to know if there is a reference to this reference
> > architecture. It would help alleviate some of the fears we have about
> > scaling this thing to a massive scale (10,000's OSDs).
> >
> > Thanks,
> > Robert LeBlanc
> >
> > On Thu, Dec 18, 2014 at 3:43 PM, Craig Lewis
> > <clewis@xxxxxxxxxxxxxxxxxx> wrote:
> >
> >>
> >>
> >> On Thu, Dec 18, 2014 at 5:16 AM, Patrick McGarry <patrick@xxxxxxxxxxx>
> >> wrote:
> >>>
> >>>
> >>> > 2. What should be the minimum hardware requirement of the server
> >>> > (CPU, Memory, NIC etc)
> >>>
> >>> There is no real "minimum" to run Ceph, it's all about what your
> >>> workload will look like and what kind of performance you need. We
> >>> have seen Ceph run on Raspberry Pis.
> >>
> >>
> >> Technically, the smallest cluster is a single node with a 10 GiB disk.
> >> Anything smaller won't work.
> >>
> >> That said, Ceph was envisioned to run on large clusters.  IIRC, the
> >> reference architecture has 7 rows, each row having 10 racks, all full.
> >>
> >> Those of us running small clusters (less than 10 nodes) are noticing
> >> that it doesn't work quite as well.  We have to significantly scale
> >> back the amount of backfilling and recovery that is allowed.  I try
> >> to keep all backfill/recovery operations touching less than 20% of my
> >> OSDs.  In the reference architecture, it could lose a whole row, and
> >> still keep under that limit.  My 5 nodes cluster is noticeably better
> >> better than the 3 node cluster.  It's faster, has lower latency, and
> >> latency doesn't increase as much during recovery operations.
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com