Re: Ceph block storage cluster limitations

Christian Balzer <chibi@xxxxxxx> · Sun, 31 Mar 2019 17:57:44 +0900

Hello,

essentially Anthony has given all the answers here.

Seeing as I'm sketching out a double digit PetaByte cluster ATM, I'm piping
up anyway.

On Sat, 30 Mar 2019 15:53:14 -0700 Anthony D'Atri wrote:

> > Hello,
> > 
> > I wanted to know if there are any max limitations on
> > 
> > - Max number of Ceph data nodes
> > - Max number of OSDs per data node
> > - Global max on number of OSDs
> > - Any limitations on the size of each drive managed by OSD?
> > - Any limitation on number of client nodes?
> > - Any limitation on maximum number of RBD volumes that can be created?  
>
For something you want to implement it probably wouldn't hurt to to shovel
some gold towards RedHat and get replies in writing from them. 
_IF_ they're willing to make such statments. ^o^

> I don’t think there any *architectural* limits, but there can be *practical* limits.  There are a lot of variables and everyone has a unique situation, but some thoughts:
> 
> > Max number of Ceph data nodes  
> 
> May be limited at some extreme by networking.  Don’t cheap out on your switches.
> 
Not every OSD needs to talk to every other OSDs and the MON communication
is light, but yeah, it adds up. 
Note that I said OSDs, not actual hosts, they don't count in any traffic
equations.

> > - Max number of OSDs per data node  
> 
> People have run at least 72.  Consider RAM required for a given set of drives, and that a single host/chassis isn’t a big percentage of your cluster.  Ie., don’t have a huge fault domain that will bite you later.  For a production cluster at scale I would suggest at least 12 OSD nodes, but this depends on lots of variables.  Conventional wisdom is 1GB RAM per 1TB of OSD; in practice for a large cluster I would favor somewhat more.  A cluster with, say, 3 nodes of 72 OSDs each is going to be in bad way when one fails.
> 

Yes, proper sizing of failure domains is a must.
Looking at the recent bluestore discussions it feels like 4GB RAM per OSD
is a good starting point. That's something affected by your use case and
requirements (more RAM, more caching can be configured).

> > - Global max on number of OSDs  
> 
> A cluster with at lest 10800 has existed.
> 
> https://indico.cern.ch/event/542464/contributions/2202295/attachments/1289543/1921810/cephday-dan.pdf
> https://indico.cern.ch/event/649159/contributions/2761965/attachments/1544385/2423339/hroussea-storage-at-CERN.pdf
> 
> The larger a cluster becomes, the more careful attention must be paid to topology and tuning.
> 
Amen to that. 

For the same 10PB cluster I've come up with designs ranging from nearly 800
nodes and 13000 OSDs (3x replica, 16x 2.4TB HDDs per 2U node) to slightly
more than half of these numbers with the same HW but 10+4 erasure encoding.

These numbers made me more than slightly anxious and had other people (who
wanted to use that existing HE) outright faint or run away screaming. 

A different design with 60 effective OSDs (10TB HDDs) per 4U  requires
just 28 nodes and 1800 OSDs with 10+4 EC, much more manageable with
regards to numbers and rack space.

The same 4U hardware with 6 RAID6 (10 HDDs each) and thus 6 OSDs per node
and 3x replica on the Ceph level requires 64 nodes and results in only 384
OSDs.
If this is a feasible design will no doubt be debated here by purists. 
It has however the distinct advantage of extreme resilience and is very
unlikely to ever require a rebuild due to a failed OSD (HDD).
And given 1800 disks, failures are statistically going to be a common
occurrence.
It also allows for significantly less CPU resources due to no EC and far
fewer OSDs. 

For the record and to forestall some comments, this is for an object
storage (RGW) cluster dealing with largish (3MB average) objects, so IOPS
aren't the prime objective here.

See also my "Erasure Coding failure domain" mail just now.

> > Also, any advise on using NVMes for OSD drives?  
> 
> They rock.  Evaluate your servers carefully:
> * Some may route PCI through a multi-mode SAS/SATA HBA
> * Watch for PCI bridges or multiplexing
> * Pinning, minimize data over QPI links
> * Faster vs more cores can squeeze out more performance 
> 
> AMD Epyc single-socket systems may be very interesting for NVMe OSD nodes.
> 
I'm happy that somebody else spotted this. ^o^

Regards,

Christian

> > What is the known maximum cluster size that Ceph RBD has been deployed to?  
> 
> See above.
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Rakuten Communications
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com