Dear Abby: Why Is Architecting CEPH So Hard?

"Linus VanWeil" <cody.schmidt@xxxxxxxxxxxxxxxxxxx> · Thu, 23 Apr 2020 15:11:23 -0000

Hello,

Looks like the original chain got deleted, but thank you to everyone who responded. Just to keep any new-comers in the loop, I have pasted the original positing below. To all the original contributors to this chain, I feel much more confident in my design theory for the storage nodes. However, I wanted to narrow the focus and see if I can get any elaborated comments on the two below topics.

Does anyone have any real-world data on metrics I can use to size MONs?
When are they active?
When do they utilize CPU, RAM, Storage (ie. larger storage pools require more resources, resources are used during recovery, etc.)?

For anyone that commented or has opinions on Storage node sizing:
How does choosing EC vs 3X replication affect your sizing of CPU / RAM?
IS the some kind of over-head generalization I can use if assuming EC (ie. add an extra core per OSD)? I understand that recoveries are where this is most important, so I am looking for sizing metrics based on living through worst case scenarios.

---------------------------
ORIGINAL POSTING:

Hey Folks,

This is my first ever post here in the CEPH user group and I will preface with the fact
that I know this is a lot of what many people ask frequently. Unlike what I assume to be a
large majority of CEPH “users” in this forum, I am more of a CEPH “distributor.” My
interests lie in how to build a CEPH environment to best fill an organization’s needs.I am
here for the real-world experience and expertise so that I can learn to build CEPH
“right.” I have spent the last couple years collecting data on general “best practices”
through forum posts, CEPH documentation, CEPHLACON, etc. I wanted to post my findings to
the forum to see where I can harden my stance.

Below are two example designs that I might use when architecting a solution currently. I
have specific questions around design elements in each that I would like you to approve
for holding water or not. I want to focus on the hardware, so I am asking for
generalizations where possible. Let’s assume in all scenarios that we are using Luminous
and that the data type is mixed use.
I am not expecting anyone to run through every question, so please feel free to comment on
any piece you can. Tell me what is overkill and what is lacking!

Example 1:
8x 60-Bay (8TB) Storage nodes (480x 8TB SAS Drives)
Storage Node Spec:
2x 32C 2.9GHz AMD EPYC
- Documentation mentions .5 cores per OSD for throughput optimized. Are they talking
about .5 Physical cores or .5 Logical cores?
- Is it better to pick my processors based on a total GHz measurement like 2GHz per
OSD?
- Would a theoretical 8C at 2GHz serve the same number of OSDs as a 16C at 1GHz? Would
Threads be included in this calculation?
512GB Memory
- I know this is the hot topic because of its role in recoveries. Basically, I am
looking for the most generalized practice I can use as a safe number and a metric I can
use as a nice to have.
- Is it 1GB of RAM per TB of RAW OSD?
2x 3.2TB NVMe WAHLDB / Log Drive
- Another hot topic that I am sure will bring many “it depends.” All I am looking for
is experience on this. I know people have mentioned having at least 70GB of Flash for
WAHLDB / Logs.
- Can I use 70GB as a flat calculation per OSD or is it depend on the Size of the OSD?
- I know more is better, but what is a number I can use to get started with minimal
issues?
2x 56Gbit Links
- I think this should be enough given the rule of thumb of 10Gbit for every 12 OSDs.
3x MON Node
MON Node Spec:
1x 8C 3.2GHz AMD EPYC
- I can’t really find good practices around when to increase your core count. Any
suggestions?
128GB Memory
- What do I need memory for in a MON?
- When do I need to expand?
2x 480GB Boot SSDs
- Any reason to look more closely into the sizing of these drives?
2x 25Gbit Uplinks
- Should these match the output of the storage nodes for any reason?

Example 2:
8x 12-Bay NVMe Storage nodes (96x 1.6TB NVMe Drives)
Storage Node Spec:
2x 32C 2.9GHz AMD EPYC
- I have read that each NMVe OSD should have 10 cores. I am not splitting Physical
drives into multiple OSDs so let’s assume I have 12 OSD per Node.
- Would threads count toward my 10 core quota or just physical cores?
- Can I do a similar calculation as I mentioned before and just use 20GHz per OSD
instead of focusing on cores specifically?
512GB Memory
- I assume there is some reason I can’t use the same methodology of 1GB per TB of OSD
since this is NVMe storage
2x 100Gbit Links
- This is assuming about 1Gigabyte per second of real-world speed per disk

3x MON Node – What differences should MONs serving NVMe have compared to large NLSAS
pools?
MON Node Spec:
1x 8C 3.2GHz AMD Epyc
128GB Memory
2x 480GB Boot SSDs
2x 25Gbit Uplinks
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx