Re: Dear Abby: Why Is Architecting CEPH So Hard?

Martin Verges <martin.verges@xxxxxxxx> · Thu, 23 Apr 2020 07:34:22 +0200

Hello Cody,

There are a few simple rules to design a good, stable and performant Ceph
cluster.

1) Don't choose big systems. Not only because often they are more
expensive, but you also have more impact when a system is down.

2) Throw away all the not required stuff like RAID controllers, make the
system as simple as possible.

3) Plan CPU with a rule of thumb:
 - for HDD 1 cpu thread of any cpu is ok
 - for SSD/NVMe midrange 1 cpu core is most likely ok
 - for high end NVMe up to 4 cpu cores (8 threads) can be consumed, most
setups would be ok with 2 cores per disk
And generally, the faster the cores, the better it will be. This is
especially important on high end NVMe.

4) Plan Memory by Number of OSD drives * 6-8 GB and then choose the next
optimal dimm config (for example 128 GB).

5) Network:
 - HDD don't provide a good performance, 2*10G is totally fine
 - SSD/NVMe midrange can exceed 10G so it would be the bare minimum, but
100G are way too much ;)
 - NVMe high end can cause a dual 40G link to exceed but honestly, I never
saw client traffic, only ceph recovery in that performance range
 And overall, choose a modern all path active network design, like leaf
spine with vxlan to scale

6) DB/WAL:
 - definitely will decrease latency
 - can increase performance
 - do require long lasting write intensive flash if you don't want to get
in trouble with them
 - Sizing this is a hot topic ;). I currently just plan 300G (not 299) per
OSD for best performance. Choose a PCIe interface, don't choose SATA
interface for DB/WAL it will be a bottleneck.

You can colocate any service of a unified Ceph on all the hosts. If you add
services like MON, RGW, MDS you need to add some extra resources to your
calculation
MON) Just throw it in, the rule of thumb above will work without a problem
RGW) Metadata requires an SSD/NVMe pool as HDD is too slow, depending on
the required performance, some more CPU is required. As we plan more but
smaller servers, load can be distributed across more nodes, it scales much
better.
MDS) Can easily consume high Memory rates. Again depending on the use-case
how much it will need. Most likely adding it to the rule of thumb is ok but
if there are many open files, choose the next bigger dimm config.

In the end, especially inexperienced customers do have a great need for
good Ceph management as well. If you are interested, please feel free to
contact me and I will show you how we do it. We also have reseller options,
maybe that's something for you.

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.verges@xxxxxxxx
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx

Am Mi., 22. Apr. 2020 um 23:47 Uhr schrieb <cody.schmidt@xxxxxxxxxxxxxxxxxxx
>:

> Hey Folks,
>
> This is my first ever post here in the CEPH user group and I will preface
> with the fact that I know this is a lot of what many people ask frequently.
> Unlike what I assume to be a large majority of CEPH “users” in this forum,
> I am more of a CEPH “distributor.” My interests lie in how to build a CEPH
> environment to best fill an organization’s needs.I am here for the
> real-world experience and expertise so that I can learn to build CEPH
> “right.” I have spent the last couple years collecting data on general
> “best practices” through forum posts, CEPH documentation, CEPHLACON, etc. I
> wanted to post my findings to the forum to see where I can harden my stance.
>
> Below are two example designs that I might use when architecting a
> solution currently. I have specific questions around design elements in
> each that I would like you to approve for holding water or not. I want to
> focus on the hardware, so I am asking for generalizations where possible.
> Let’s assume in all scenarios that we are using Luminous and that the data
> type is mixed use.
> I am not expecting anyone to run through every question, so please feel
> free to comment on any piece you can. Tell me what is overkill and what is
> lacking!
>
> Example 1:
> 8x 60-Bay (8TB) Storage nodes (480x 8TB SAS Drives)
> Storage Node Spec:
> 2x 32C 2.9GHz AMD EPYC
>    - Documentation mentions .5 cores per OSD for throughput optimized. Are
> they talking about .5 Physical cores or .5 Logical cores?
>    - Is it better to pick my processors based on a total GHz measurement
> like 2GHz per OSD?
>    - Would a theoretical 8C at 2GHz serve the same number of OSDs as a 16C
> at 1GHz? Would Threads be included in this calculation?
> 512GB Memory
>    - I know this is the hot topic because of its role in recoveries.
> Basically, I am looking for the most generalized practice I can use as a
> safe number and a metric I can use as a nice to have.
>    - Is it 1GB of RAM per TB of RAW OSD?
> 2x 3.2TB NVMe WAHLDB / Log Drive
>    - Another hot topic that I am sure will bring many “it depends.” All I
> am looking for is experience on this. I know people have mentioned having
> at least 70GB of Flash for WAHLDB / Logs.
>    - Can I use 70GB as a flat calculation per OSD or is it depend on the
> Size of the OSD?
>    - I know more is better, but what is a number I can use to get started
> with minimal issues?
> 2x 56Gbit Links
> - I think this should be enough given the rule of thumb of 10Gbit for
> every 12 OSDs.
> 3x MON Node
> MON Node Spec:
> 1x 8C 3.2GHz AMD EPYC
> - I can’t really find good practices around when to increase your core
> count. Any suggestions?
> 128GB Memory
>    - What do I need memory for in a MON?
>    - When do I need to expand?
> 2x 480GB Boot SSDs
>    - Any reason to look more closely into the sizing of these drives?
> 2x 25Gbit Uplinks
>    - Should these match the output of the storage nodes for any reason?
>
>
> Example 2:
> 8x 12-Bay NVMe Storage nodes (96x 1.6TB NVMe Drives)
> Storage Node Spec:
> 2x 32C 2.9GHz AMD EPYC
>    - I have read that each NMVe OSD should have 10 cores. I am not
> splitting Physical drives into multiple OSDs so let’s assume I have 12 OSD
> per Node.
>    - Would threads count toward my 10 core quota or just physical cores?
>    - Can I do a similar calculation as I mentioned before and just use
> 20GHz per OSD instead of focusing on cores specifically?
> 512GB Memory
>    - I assume there is some reason I can’t use the same methodology of
> 1GB  per TB of OSD since this is NVMe storage
> 2x 100Gbit Links
>    - This is assuming about 1Gigabyte per second of real-world speed per
> disk
>
> 3x MON Node – What differences should MONs serving NVMe have compared to
> large NLSAS pools?
> MON Node Spec:
> 1x 8C 3.2GHz AMD Epyc
> 128GB Memory
> 2x 480GB Boot SSDs
> 2x 25Gbit Uplinks
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx