Hello, essentially Anthony has given all the answers here. Seeing as I'm sketching out a double digit PetaByte cluster ATM, I'm piping up anyway. On Sat, 30 Mar 2019 15:53:14 -0700 Anthony D'Atri wrote: > > Hello, > > > > I wanted to know if there are any max limitations on > > > > - Max number of Ceph data nodes > > - Max number of OSDs per data node > > - Global max on number of OSDs > > - Any limitations on the size of each drive managed by OSD? > > - Any limitation on number of client nodes? > > - Any limitation on maximum number of RBD volumes that can be created? > For something you want to implement it probably wouldn't hurt to to shovel some gold towards RedHat and get replies in writing from them. _IF_ they're willing to make such statments. ^o^ > I don’t think there any *architectural* limits, but there can be *practical* limits. There are a lot of variables and everyone has a unique situation, but some thoughts: > > > Max number of Ceph data nodes > > May be limited at some extreme by networking. Don’t cheap out on your switches. > Not every OSD needs to talk to every other OSDs and the MON communication is light, but yeah, it adds up. Note that I said OSDs, not actual hosts, they don't count in any traffic equations. > > - Max number of OSDs per data node > > People have run at least 72. Consider RAM required for a given set of drives, and that a single host/chassis isn’t a big percentage of your cluster. Ie., don’t have a huge fault domain that will bite you later. For a production cluster at scale I would suggest at least 12 OSD nodes, but this depends on lots of variables. Conventional wisdom is 1GB RAM per 1TB of OSD; in practice for a large cluster I would favor somewhat more. A cluster with, say, 3 nodes of 72 OSDs each is going to be in bad way when one fails. > Yes, proper sizing of failure domains is a must. Looking at the recent bluestore discussions it feels like 4GB RAM per OSD is a good starting point. That's something affected by your use case and requirements (more RAM, more caching can be configured). > > - Global max on number of OSDs > > A cluster with at lest 10800 has existed. > > https://indico.cern.ch/event/542464/contributions/2202295/attachments/1289543/1921810/cephday-dan.pdf > https://indico.cern.ch/event/649159/contributions/2761965/attachments/1544385/2423339/hroussea-storage-at-CERN.pdf > > The larger a cluster becomes, the more careful attention must be paid to topology and tuning. > Amen to that. For the same 10PB cluster I've come up with designs ranging from nearly 800 nodes and 13000 OSDs (3x replica, 16x 2.4TB HDDs per 2U node) to slightly more than half of these numbers with the same HW but 10+4 erasure encoding. These numbers made me more than slightly anxious and had other people (who wanted to use that existing HE) outright faint or run away screaming. A different design with 60 effective OSDs (10TB HDDs) per 4U requires just 28 nodes and 1800 OSDs with 10+4 EC, much more manageable with regards to numbers and rack space. The same 4U hardware with 6 RAID6 (10 HDDs each) and thus 6 OSDs per node and 3x replica on the Ceph level requires 64 nodes and results in only 384 OSDs. If this is a feasible design will no doubt be debated here by purists. It has however the distinct advantage of extreme resilience and is very unlikely to ever require a rebuild due to a failed OSD (HDD). And given 1800 disks, failures are statistically going to be a common occurrence. It also allows for significantly less CPU resources due to no EC and far fewer OSDs. For the record and to forestall some comments, this is for an object storage (RGW) cluster dealing with largish (3MB average) objects, so IOPS aren't the prime objective here. See also my "Erasure Coding failure domain" mail just now. > > Also, any advise on using NVMes for OSD drives? > > They rock. Evaluate your servers carefully: > * Some may route PCI through a multi-mode SAS/SATA HBA > * Watch for PCI bridges or multiplexing > * Pinning, minimize data over QPI links > * Faster vs more cores can squeeze out more performance > > AMD Epyc single-socket systems may be very interesting for NVMe OSD nodes. > I'm happy that somebody else spotted this. ^o^ Regards, Christian > > What is the known maximum cluster size that Ceph RBD has been deployed to? > > See above. > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Rakuten Communications _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com