Re: crushmap rules :: host selection

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Sun, 28 Jan 2024 16:34:00 -0500

> 
>>> so it depends on failure domain .. but with host failure domain, if there is space on some other OSDs
>>> will the missing OSDs be "healed" on the available space on some other OSDs?
>> Yes, if you have enough hosts.  When using 3x replication it is thus advantageous to have at least 4 hosts.  Remember that the
> yeah, this is the reason that i thought to 2x replication, as i was thinking to request 3 hosts

You can do RF=3 with 3 hosts.

>> placement granularity is the PG, not OSD.   Each placement group is placed independently.   Think of CRUSH as a hash function.   
> yeah, i keep remembering this and then i forget :)
> is it wrong to think of PGs like a kind of object bucket (S3 like)?

Mostly, yes.

> if so, is the size of PG the limit of a file size? (for replicated case)?

Modern Ceph clusters use the BlueStore back-end for OSDs.  There is no limit as such on PG size.  A PG will on average store 1/pg_num of the data in a pool

> if i have 20 PGs of 10 Gb size, uniformly occupied 50%, will i get an "no space" error
> if i try to write an 6 GB file?

PGs aren’t sized, so no.
> 
> 
>>>>> I'm thinking about a 3 node cluster with the replica=2
>>>>> failure domain = host, in such a way if one node is down, the data
>>>>> from there to be replicated on the remaining nodes
>>>> If one node is down the PGs will remain undersized because OSDs must be on disjoint hosts. Oh wait, you wrote size=2.
>>> and will the healing (rebuilding/resilvering) process start immediately ? or after some time?
>> Pretty much immediately.  Ceph is fanatical about strong consistency.
> isn't there any time delay? maybe configurable?
> for example if some technician pulls the wrong network cable for a few seconds
> i wouldn't like the healing to start on the microsecond that something went wrong

There are lots of nuances  haven’t mentioned so as to not confuse you with stuff you don’t need to know, at least not yet.  There is a configurable grace period before an OSD is marked down, and if you use `rack` failure domains you can prevent automatic rebalancing if an entire host goes down.  “Immediately” is relative.  I think the default grace is …. 5 minutes or so.

Ceph has like 2000 “options” that can be set.  Most of them you don’t need to know about and should never touch.

> 
>>> , if one is down when the replacement is added it will take ~seq speed to rebuild the mirror
>>> so for 22 TB
>> Ugh, HDDs for the lose.
> well, i have to goals :
> 1. replacing the multiple VM images that i have replicated on different hosts
> (multiple for the same VM!! .. it's a mess) and have a RBD pool for these

Ceph is great at block storage with RBD.  Most OpenStack installations use it for Glance and Cinder.

Note that if by image you mean guest OS images, then you may get away with HDDs.  If you mean images as in collections of data that get attached as a block storage volume to a VM that will use it for a database or as a filesystem, those huge spinners will likely not make you glad.

> 
> 
>>> usually is ~40 hours
>>> what is different, and what are the potential problems that can explode to data loss?
>> HDD dirty little secret # 437:  slow healing means an extended period of risky degraded redundancy.  Especially if you get stuck with SMR drives.
> i do not use SMR, i have only the normal CMR, and this 30-40 hours is the 10+2 RAID6 rebuilding time

10+2?  Yeah.  Striping 2x 5+2 parity groups would give you double the write performance.  With conventional parity RAID the space amplification benefit to large parity groups rapidly sees diminishing returns, but the write performance hit is extreme.

> 
>> I would expect much more than 40 hours for a 22TB spinner - Ceph tries to limit recovery speed so that clients aren’t DoSed.   Erasure coding exacerbates this situation.  I’ve seen an 8TB OSD take 4 weeks to backfill when throttled enough to allow client traffic.  HDDs have slow, narrow interfaces with rotational / seek latency and thus are a false economy.
> well, then this is worrisome for me .. as i said above, in a 12 disks RAID6 array, the rebuild time for 1 disk is under
> 40 hours (but it's true that i make the partition RO so no additional writes hammers the drives, only reads)

Ceph recovery times likely will be less, because you aren’t necessarily healing the full range of LBAs on a drive.

> 
>> Ceph has (adjustable) ratios for OSD fullness.  You generally want to maintain enough unused space to allow healing when drives fail.  If one is used to embedded or even software RAID this idea can take some time to get.  It makes Ceph WAY more flexible, e.g. you don’t have to maintain exact-size spare drives.  Though there are advantages to having not having huge variation in size.
> oh, that means that you can do something like "use at most 90% of space for normal opperations, but if
> there is a healing balancing use the rest of 10% until new OSD is in place and balancing will free that 10%" ?

Something like that.

ceph/src/common/options/global.yaml.in :

# writes will fail if an OSD exceeds this fullness
- name: mon_osd_full_ratio
  type: float
  level: advanced
  desc: full ratio of OSDs to be set during initial creation of the cluster
  default: 0.95
  flags:
  - no_mon_update
  - cluster_create
  with_legacy: true

# an OSD will refuse taking backfill if it exceeds this fullness
- name: mon_osd_backfillfull_ratio
  type: float
  level: advanced
  default: 0.9
  flags:
  - no_mon_update
  - cluster_create
  with_legacy: true

# The cluster’s health state will go WARN if any OSD exceeds this fullness
- name: mon_osd_nearfull_ratio
  type: float
  level: advanced
  desc: nearfull ratio for OSDs to be set during initial creation of cluster
  default: 0.85
  flags:
  - no_mon_update
  - cluster_create

Recent Ceph releases enforce that the values of these options must be set in this order.  Having the nearfull ratio be 99% when the full ratio is 95% for example would make no sense.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx