Dear Abby: Why Is Architecting CEPH So Hard?

cody.schmidt@xxxxxxxxxxxxxxxxxxx · Wed, 22 Apr 2020 21:47:07 -0000

Hey Folks,

This is my first ever post here in the CEPH user group and I will preface with the fact that I know this is a lot of what many people ask frequently. Unlike what I assume to be a large majority of CEPH “users” in this forum, I am more of a CEPH “distributor.” My interests lie in how to build a CEPH environment to best fill an organization’s needs.I am here for the real-world experience and expertise so that I can learn to build CEPH “right.” I have spent the last couple years collecting data on general “best practices” through forum posts, CEPH documentation, CEPHLACON, etc. I wanted to post my findings to the forum to see where I can harden my stance.

Below are two example designs that I might use when architecting a solution currently. I have specific questions around design elements in each that I would like you to approve for holding water or not. I want to focus on the hardware, so I am asking for generalizations where possible. Let’s assume in all scenarios that we are using Luminous and that the data type is mixed use. 
I am not expecting anyone to run through every question, so please feel free to comment on any piece you can. Tell me what is overkill and what is lacking!

Example 1:
8x 60-Bay (8TB) Storage nodes (480x 8TB SAS Drives)
Storage Node Spec: 
2x 32C 2.9GHz AMD EPYC
   - Documentation mentions .5 cores per OSD for throughput optimized. Are they talking about .5 Physical cores or .5 Logical cores?
   - Is it better to pick my processors based on a total GHz measurement like 2GHz per OSD?
   - Would a theoretical 8C at 2GHz serve the same number of OSDs as a 16C at 1GHz? Would Threads be included in this calculation?
512GB Memory
   - I know this is the hot topic because of its role in recoveries. Basically, I am looking for the most generalized practice I can use as a safe number and a metric I can use as a nice to have. 
   - Is it 1GB of RAM per TB of RAW OSD?
2x 3.2TB NVMe WAHLDB / Log Drive
   - Another hot topic that I am sure will bring many “it depends.” All I am looking for is experience on this. I know people have mentioned having at least 70GB of Flash for WAHLDB / Logs. 
   - Can I use 70GB as a flat calculation per OSD or is it depend on the Size of the OSD?
   - I know more is better, but what is a number I can use to get started with minimal issues?
2x 56Gbit Links
- I think this should be enough given the rule of thumb of 10Gbit for every 12 OSDs.
3x MON Node
MON Node Spec:
1x 8C 3.2GHz AMD EPYC
- I can’t really find good practices around when to increase your core count. Any suggestions?
128GB Memory
   - What do I need memory for in a MON?
   - When do I need to expand?
2x 480GB Boot SSDs
   - Any reason to look more closely into the sizing of these drives?
2x 25Gbit Uplinks
   - Should these match the output of the storage nodes for any reason?

Example 2:
8x 12-Bay NVMe Storage nodes (96x 1.6TB NVMe Drives)
Storage Node Spec: 
2x 32C 2.9GHz AMD EPYC
   - I have read that each NMVe OSD should have 10 cores. I am not splitting Physical drives into multiple OSDs so let’s assume I have 12 OSD per Node.
   - Would threads count toward my 10 core quota or just physical cores?
   - Can I do a similar calculation as I mentioned before and just use 20GHz per OSD instead of focusing on cores specifically?
512GB Memory
   - I assume there is some reason I can’t use the same methodology of 1GB  per TB of OSD since this is NVMe storage
2x 100Gbit Links
   - This is assuming about 1Gigabyte per second of real-world speed per disk

3x MON Node – What differences should MONs serving NVMe have compared to large NLSAS pools?
MON Node Spec:
1x 8C 3.2GHz AMD Epyc
128GB Memory
2x 480GB Boot SSDs
2x 25Gbit Uplinks
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx