Re: The largest cluster for now?

Dan Van Der Ster <daniel.vanderster@xxxxxxx> · Thu, 10 Nov 2016 11:30:59 +0000

Hi,

> On 10 Nov 2016, at 12:17, han vincent <hangzws@xxxxxxxxx> wrote:
> 
> Hello, all:
>    Recently, I have a plan to build a large-scale ceph cluster in
> production for Openstack. I want to build the  cluster as larger as
> possible.
>    In the following maillist, Karol has asked a question about
> "largest ceph cluster":
>        http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-April/028371.html
>    In this maillist, dreamhost and CERN said they both had build a
> 3-PB cluster.
> 
>    In the last few days, I had read the CERN's report "Ceph ~30PB Test Report"
>         https://cds.cern.ch/record/2015206/files/CephScaleTestMarch2015.pdf

Most of those issues were fixed in the jewel release.

>    In order to build such a large cluster, the guys of CERN had made
> some changes:
>    1. Set noin, noup flags before osds to be activate to avoid osdmap
> from changing frequently

Probably not needed any longer.

>    2. Do as the following configurations, the memory consumption of
> OSD and monitor daemons will decrease from ~2GB to ~500MB
> 
>    [global]
>      osd map message max=10
>    [osd]
>      osd map cache size=20
>      osd map max advance=10
>      osd map share max epochs=10
>      osd pg epoch persisted max stale=10
> 

Maybe needed, depends how much RAM you have.

>    3. ADD SSDs to monitors, because the monitors are overloaded with
> too many OSD creation transactions

SSDs on the mons are essential, IMHO.

>    4. Upgraded the verion of ceph to Hammer to avoid leveldb from
> increasing rapidly.

Use ceph jewel.

>    it seems that CERN's 30-PB cluster is for test only and not yet in
> production environment?

Correct, that was a test. We repeated the test more recently and had far fewer problems. See slide 19: https://indico.cern.ch/event/542464/contributions/2202295/attachments/1289543/1921810/cephday-dan.pdf

>    I wonder to know on the current situation, how large the cluster
> is the best fit for the production environment. 3-PB? 30-PB? or
> bigger?

The important limitation is on the number of OSDs. There are several 2000 OSD clusters. Our largest clusters in production have just over 1000 OSDs.
We've tested up to 7200 OSDs in the past, but it practise I would avoid exceeding, say, 5000 OSDs with the jewel code.

In fact, for the OpenStack use-case, there is no reason you shouldn't run several Ceph clusters. To be on the safe side, I would advise you run several 1000-1500 OSD clusters. Once you have the deployment and operations well tuned, operating several clusters is not much more work than operating one very large cluster, and it would avoid some potential issues.

>    And if a large-scale cluster has been build, how to maintain such
> a large cluster in the latter days?
>    What's the core issues of the large cluster, and what can we do to
> avoid the potential problems?

Memory usage of OSDs during recovery is one.
Inpracticality of making large changes to the CRUSH map is another.
Maybe there are other pain points I'm forgetting.

Good luck!

Cheers, Dan

_______________________________________________
Ceph-large mailing list
Ceph-large@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com