Re: crushmap rules :: host selection

Adrian Sevcenco <Adrian.Sevcenco@xxxxxxx> · Mon, 29 Jan 2024 00:02:10 +0200

-------- Original Message --------
Subject:  crushmap rules :: host selection
From: Anthony D'Atri
To: Adrian Sevcenco
Date: 1/28/2024, 11:34:00 PM

so it depends on failure domain .. but with host failure domain, if there is space on some other OSDs
will the missing OSDs be "healed" on the available space on some other OSDs?
Yes, if you have enough hosts.  When using 3x replication it is thus advantageous to have at least 4 hosts.  Remember that the
yeah, this is the reason that i thought to 2x replication, as i was thinking to request 3 hosts

You can do RF=3 with 3 hosts.
yeah, i will see what i can get approved

placement granularity is the PG, not OSD.   Each placement group is placed independently.   Think of CRUSH as a hash function.
yeah, i keep remembering this and then i forget :)
is it wrong to think of PGs like a kind of object bucket (S3 like)?

Mostly, yes.
so .. in a PG there are no "file data" but pieces of "file data"?
so 100 GB file with 2x replication will be placed in more than 2 PGs?
Is there some kind of strips or chunks that a given data conglomerate is split into
and then _those_ are put in multiple PGs with the rule that they have to be in a Replicated=X ?

if so, is the size of PG the limit of a file size? (for replicated case)?

Modern Ceph clusters use the BlueStore back-end for OSDs.  There is no limit as such on PG size.  A PG will on average store 1/pg_num of the data in a pool

if i have 20 PGs of 10 Gb size, uniformly occupied 50%, will i get an "no space" error
if i try to write an 6 GB file?

PGs aren’t sized, so no.

I'm thinking about a 3 node cluster with the replica=2
failure domain = host, in such a way if one node is down, the data
from there to be replicated on the remaining nodes
If one node is down the PGs will remain undersized because OSDs must be on disjoint hosts. Oh wait, you wrote size=2.
and will the healing (rebuilding/resilvering) process start immediately ? or after some time?
Pretty much immediately.  Ceph is fanatical about strong consistency.
isn't there any time delay? maybe configurable?
for example if some technician pulls the wrong network cable for a few seconds
i wouldn't like the healing to start on the microsecond that something went wrong

There are lots of nuances  haven’t mentioned so as to not confuse you with stuff you don’t need to know, at least not yet.  There is a configurable grace period before an OSD is marked down, and if you use `rack` failure domains you can prevent automatic rebalancing if an entire host goes down.  “Immediately” is relative.  I think the default grace is …. 5 minutes or so.
Oh, got it! well, at least is there, if i need it i will ask about :)

Ceph has like 2000 “options” that can be set.  Most of them you don’t need to know about and should never touch.
great! better to have many tunables and then just ask for usage then to have things hardcoded

, if one is down when the replacement is added it will take ~seq speed to rebuild the mirror
so for 22 TB
Ugh, HDDs for the lose.
well, i have to goals :
1. replacing the multiple VM images that i have replicated on different hosts
(multiple for the same VM!! .. it's a mess) and have a RBD pool for these

Ceph is great at block storage with RBD.  Most OpenStack installations use it for Glance and Cinder.

Note that if by image you mean guest OS images, then you may get away with HDDs.  If you mean images as in collections of data that get attached as a block storage volume to a VM that will use it for a database or as a filesystem, those huge spinners will likely not make you glad.
The 2nd case, and yes for RBD i plan to use nvme (i have the OS images for now but i know that i can convert and import 
them into rbd)

usually is ~40 hours
what is different, and what are the potential problems that can explode to data loss?
HDD dirty little secret # 437:  slow healing means an extended period of risky degraded redundancy.  Especially if you get stuck with SMR drives.
i do not use SMR, i have only the normal CMR, and this 30-40 hours is the 10+2 RAID6 rebuilding time

10+2?  Yeah.  Striping 2x 5+2 parity groups would give you double the write performance.  With conventional parity RAID the space amplification benefit to large parity groups rapidly sees diminishing returns, but the write performance hit is extreme.

on these cases (NFS like usage) i seldom have random access but space is the most important, but this is to
be seen after i have the pilot working already

I would expect much more than 40 hours for a 22TB spinner - Ceph tries to limit recovery speed so that clients aren’t DoSed.   Erasure coding exacerbates this situation.  I’ve seen an 8TB OSD take 4 weeks to backfill when throttled enough to allow client traffic.  HDDs have slow, narrow interfaces with rotational / seek latency and thus are a false economy.
well, then this is worrisome for me .. as i said above, in a 12 disks RAID6 array, the rebuild time for 1 disk is under
40 hours (but it's true that i make the partition RO so no additional writes hammers the drives, only reads)

Ceph recovery times likely will be less, because you aren’t necessarily healing the full range of LBAs on a drive.
hmmm... right!! this will be a nice thing!

Ceph has (adjustable) ratios for OSD fullness.  You generally want to maintain enough unused space to allow healing when drives fail.  If one is used to embedded or even software RAID this idea can take some time to get.  It makes Ceph WAY more flexible, e.g. you don’t have to maintain exact-size spare drives.  Though there are advantages to having not having huge variation in size.
oh, that means that you can do something like "use at most 90% of space for normal opperations, but if
there is a healing balancing use the rest of 10% until new OSD is in place and balancing will free that 10%" ?

Something like that.

ceph/src/common/options/global.yaml.in :

# writes will fail if an OSD exceeds this fullness
- name: mon_osd_full_ratio
   type: float
   level: advanced
   desc: full ratio of OSDs to be set during initial creation of the cluster
   default: 0.95
   flags:
   - no_mon_update
   - cluster_create
   with_legacy: true

# an OSD will refuse taking backfill if it exceeds this fullness
- name: mon_osd_backfillfull_ratio
   type: float
   level: advanced
   default: 0.9
   flags:
   - no_mon_update
   - cluster_create
   with_legacy: true

# The cluster’s health state will go WARN if any OSD exceeds this fullness
- name: mon_osd_nearfull_ratio
   type: float
   level: advanced
   desc: nearfull ratio for OSDs to be set during initial creation of cluster
   default: 0.85
   flags:
   - no_mon_update
   - cluster_create

Recent Ceph releases enforce that the values of these options must be set in this order.  Having the nearfull ratio be 99% when the full ratio is 95% for example would make no sense.

Thanks a lot for info!!! This is very useful to me and overall all this information is enough help me to make
a case for the funding request to management for a pilot installation :)

Adrian
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx