Re: crushmap rules :: host selection

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Sun, 28 Jan 2024 11:03:21 -0500

> 
> First a all, thanks a lot for for info and taking time to help
> a beginner :)

Nichts zu denken.  This is a community, it’s what we do.   Next year you’ll help someone else.  

>>> 
> Oh! so the device class is more like an arbitrary label not a immutable defined property!
> looking at https://docs.ceph.com/en/reef/rados/operations/crush-map/#device-classes
> this is not specified ...

Yes.  I verified this in the sources a couple of years ago.   I thought I added something to docs about that, I’ll take a look at the above.   

For whatever reason I think Ceph does not automatically distinguish between SATA, SAS, and PCI (NVMe) SSDs, notably.  So operators sometimes change the class of NVMe SSDs to “nvme”.  We see classes like “fast-ssd” and “slow-ssd”; in our RGW work a couple years ago with Lua we set “qlc” since one would not want to mix those with TLC SSDs.  People with lots of money might have set “optane”. 

> So i can create arbitrary sets of OSDs on which a crush rule will be set
> and when that crush rule will be applied on a pool, this will actually tie
> the tagged OSDs (with the arbitrary class name) to pool.. did i get it right?

Yes.  But this is not something most people ever need or want to do.  If you feel compelled to do so, there’s probably a better approach.  

Note that multiple pools can share OSDs and often do.  

> so it depends on failure domain .. but with host failure domain, if there is space on some other OSDs
> will the missing OSDs be "healed" on the available space on some other OSDs?

Yes, if you have enough hosts.  When using 3x replication it is thus advantageous to have at least 4 hosts.  Remember that the placement granularity is the PG, not OSD.   Each placement group is placed independently.   Think of CRUSH as a hash function.  When an OSD or host is down, the topology changes, which is an input to the function, so the resulting output - the placement mappings- change.  So Ceph converges the running state to match what’s expected.  I’m kinda surprised at how much other software still needs to be managed manually.  Ceph spoils us.  

For maintenance one can temporarily set flags that tell Ceph to not do this healing.  So when rebooting to activate a new kernel, replace a DIMM, etc, the host will be right back, so you don’t want to bother rebalancing / healing.  

> also, what will happen with the old ones? i ask for 2 main scenarios:
> 1. 1 machine breaks : the drives are ok, let's say it's just and power distributor problem and then
> the machine is put online after repair .. what will happen with the data on the OSDs?

The OSDs will come “up” and “in” (hopefully), which is a topology change, so the cluster rebalances.  Each PG peers, and only the data that changed while the OSDs were down is updated, which we call “recovery”.   

If a host is down for a very long time, it can be faster to wipe the OSDs and repopulate them in toto, but that’s an optimization you don’t need to worry about anytime soon.  

> 
> 2. a drive breaks : it is replaced, the drive is prepared and added with the same OSD number

The OSD ID may or may not be the same, depending on various factors.  But don’t worry about that at this stage.  

> (as it is replaced)
> presumably the data was already replicated/healed : what will happen with the OSD that now is empty?
> will it be detected as replaced and just used?

See above.  Remember that this isn’t dumb RAID.  Ceph will more or less balance usage across available OSDs.   

> 
>>> I'm thinking about a 3 node cluster with the replica=2
>>> failure domain = host, in such a way if one node is down, the data
>>> from there to be replicated on the remaining nodes
>> If one node is down the PGs will remain undersized because OSDs must be on disjoint hosts. Oh wait, you wrote size=2.
> and will the healing (rebuilding/resilvering) process start immediately ? or after some time?

Pretty much immediately.  Ceph is fanatical about strong consistency.  

> 
>>  Don’t do that.  You will be likely to eventually lose data.  Use size=3 min_size=2.  If one node is down the PGs will be undersized but active.
> hmm .. that means that there is a mecanism that i do not understand :)
> with RAID1 with 2 devices

That’s a bad idea for some of the same reasons. I was doing RAID1 across three devices years ago.  I think I even got HP to add that ability to their RoC HBA firmware.  

When you only keep two copies of data, sooner or later you’ll experiment overlapping failures and you’ll lose data.  Additionally in a scale-out distributed system like Ceph there are certain sequences of events that result in not having even one replica of data that is known to be complete and up to date.  

I have personally experienced all of the above.  The danger is real.  

> , if one is down when the replacement is added it will take ~seq speed to rebuild the mirror
> so for 22 TB

Ugh, HDDs for the lose.  

> usually is ~40 hours
> what is different, and what are the potential problems that can explode to data loss?

HDD dirty little secret # 437:  slow healing means an extended period of risky degraded redundancy.  Especially if you get stuck with SMR drives.  

I would expect much more than 40 hours for a 22TB spinner - Ceph tries to limit recovery speed so that clients aren’t DoSed.   Erasure coding exacerbates this situation.  I’ve seen an 8TB OSD take 4 weeks to backfill when throttled enough to allow client traffic.  HDDs have slow, narrow interfaces with rotational / seek latency and thus are a false economy.  

> 
>>> (with some drives kept as spares..)
>> This isn’t crummy RAID ;).  You generally deploy OSDs on all drives and let Ceph grow new replicas to heal if it needs to.
> ok.. but is there some kind of space reservation mechanism that would allow that spare space to be used only
> when pool needs healing?

Spare space is unused space.  There is no distinction.  Unlike many RAID implementations Ceph does not resilver blindly at drive granularity.  

Ceph has (adjustable) ratios for OSD fullness.  You generally want to maintain enough unused space to allow healing when drives fail.  If one is used to embedded or even software RAID this idea can take some time to get.  It makes Ceph WAY more flexible, e.g. you don’t have to maintain exact-size spare drives.  Though there are advantages to having not having huge variation in size.  

> 
>>> I am almost certain that from the point of view of ceph, what i'm thinking is wrong
>>> so i would love to receive some advice :)
>> Learning Ceph - Second Edition: Unifed, scalable, and reliable open source storage solution
>> https://a.co/d/9AwlerS <https://a.co/d/9AwlerS>
> ooh!! great, thanks a lot for info! :)

Best book about Ceph ever written ;)

> 
>> ;)
>> Some nuances have changed since publication but the fundamentals are still fundamental.
>> Welcome to Ceph — Ceph Documentation <https://docs.ceph.com/en/reef/>
>> docs.ceph.com <https://docs.ceph.com/en/reef/>
>>    favicon.ico <https://docs.ceph.com/en/reef/>
>> <https://docs.ceph.com/en/reef/>
>> There is work underway to add a beginner’s guide.  Until then, I suggest search engines, this list, and the first four chapters of the above.
> Yup, i will do that :)
> 
> Thanks a lot for help!
> Adrian
> 
> 
>>> 
>>> Thanks a lot!
>>> Adrian
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx