Re: crushmap rules :: host selection

Adrian Sevcenco <Adrian.Sevcenco@xxxxxxx> · Sun, 28 Jan 2024 22:10:50 +0200

-------- Original Message --------
Subject:  Re: crushmap rules :: host selection
From: Anthony D'Atri
To: Adrian Sevcenco
Date: 1/28/2024, 6:03:21 PM

First a all, thanks a lot for for info and taking time to help
a beginner :)

Nichts zu denken.  This is a community, it’s what we do.   Next year you’ll help someone else.
:)

Oh! so the device class is more like an arbitrary label not a immutable defined property!
looking at https://docs.ceph.com/en/reef/rados/operations/crush-map/#device-classes
this is not specified ...

Yes.  I verified this in the sources a couple of years ago.   I thought I added something to docs about that, I’ll take a look at the above.
i see, thanks!

For whatever reason I think Ceph does not automatically distinguish between SATA, SAS, and PCI (NVMe) SSDs, notably.  So operators sometimes change the class of NVMe SSDs to “nvme”.  We see classes like “fast-ssd” and “slow-ssd”; in our RGW work a couple years ago with Lua we set “qlc” since one would not want to mix those with TLC SSDs.  People with lots of money might have set “optane”.

So i can create arbitrary sets of OSDs on which a crush rule will be set
and when that crush rule will be applied on a pool, this will actually tie
the tagged OSDs (with the arbitrary class name) to pool.. did i get it right?

Yes.  But this is not something most people ever need or want to do.  If you feel compelled to do so, there’s probably a better approach.

Note that multiple pools can share OSDs and often do.
well, in cases where users are teams with "this is my toy" and one try to unify different "toys" (and funding resources)
under a common administrative umbrella (but keeping a clear distinction of hardware that is used only by owners)
i might have to do this .. but this is only a gathering for information to make my case for a funding request
for a pilot ceph installation ..

so it depends on failure domain .. but with host failure domain, if there is space on some other OSDs
will the missing OSDs be "healed" on the available space on some other OSDs?

Yes, if you have enough hosts.  When using 3x replication it is thus advantageous to have at least 4 hosts.  Remember that the 
yeah, this is the reason that i thought to 2x replication, as i was thinking to request 3 hosts

placement granularity is the PG, not OSD.   Each placement group is placed independently.   Think of CRUSH as a hash function.   
yeah, i keep remembering this and then i forget :)
is it wrong to think of PGs like a kind of object bucket (S3 like)?
if so, is the size of PG the limit of a file size? (for replicated case)?
if i have 20 PGs of 10 Gb size, uniformly occupied 50%, will i get an "no space" error
if i try to write an 6 GB file?

When an OSD or host is down, the topology changes, which is an input to the function, so the resulting output - the placement mappings- change.  So Ceph converges the running state to match what’s expected.  I’m kinda surprised at how much other software still needs to be managed manually.  Ceph spoils us.
i see

For maintenance one can temporarily set flags that tell Ceph to not do this healing.  So when rebooting to activate a new kernel, replace a DIMM, etc, the host will be right back, so you don’t want to bother rebalancing / healing.
i will have to keep this in mind .. when i will have the actual hardware to experiment on i will get
back here with questions :)

also, what will happen with the old ones? i ask for 2 main scenarios:
1. 1 machine breaks : the drives are ok, let's say it's just and power distributor problem and then
the machine is put online after repair .. what will happen with the data on the OSDs?

The OSDs will come “up” and “in” (hopefully), which is a topology change, so the cluster rebalances.  Each PG peers, and only the data that changed while the OSDs were down is updated, which we call “recovery”.
cool!

If a host is down for a very long time, it can be faster to wipe the OSDs and repopulate them in toto, but that’s an optimization you don’t need to worry about anytime soon.
ok, i will keep this in mind and ask about the exact procedure when the questions will not be hypothetical :)

2. a drive breaks : it is replaced, the drive is prepared and added with the same OSD number

The OSD ID may or may not be the same, depending on various factors.  But don’t worry about that at this stage.
got it

(as it is replaced)
presumably the data was already replicated/healed : what will happen with the OSD that now is empty?
will it be detected as replaced and just used?

See above.  Remember that this isn’t dumb RAID.  Ceph will more or less balance usage across available OSDs.
yeah.. it will be a tough change of habits :)

I'm thinking about a 3 node cluster with the replica=2
failure domain = host, in such a way if one node is down, the data
from there to be replicated on the remaining nodes
If one node is down the PGs will remain undersized because OSDs must be on disjoint hosts. Oh wait, you wrote size=2.
and will the healing (rebuilding/resilvering) process start immediately ? or after some time?

Pretty much immediately.  Ceph is fanatical about strong consistency.
isn't there any time delay? maybe configurable?
for example if some technician pulls the wrong network cable for a few seconds
i wouldn't like the healing to start on the microsecond that something went wrong

  Don’t do that.  You will be likely to eventually lose data.  Use size=3 min_size=2.  If one node is down the PGs will be undersized but active.
hmm .. that means that there is a mecanism that i do not understand :)
with RAID1 with 2 devices

That’s a bad idea for some of the same reasons. I was doing RAID1 across three devices years ago.  I think I even got HP to add that ability to their RoC HBA firmware.

When you only keep two copies of data, sooner or later you’ll experiment overlapping failures and you’ll lose data.  Additionally in a scale-out distributed system like Ceph there are certain sequences of events that result in not having even one replica of data that is known to be complete and up to date.

I have personally experienced all of the above.  The danger is real.
hmm .. got it, 4 host, 3 replica min_size=2 it is then :)

, if one is down when the replacement is added it will take ~seq speed to rebuild the mirror
so for 22 TB

Ugh, HDDs for the lose.
well, i have to goals :
1. replacing the multiple VM images that i have replicated on different hosts
(multiple for the same VM!! .. it's a mess) and have a RBD pool for these
2. use CEPHFS to replace a few NFS servers that lately hit performance ceil
and brought me some hard headaches with stability

so, while for (1) i plan to use some nvme ssds for (2) there will be hdds
(with the plan that after i make the pilot running to somehow use the drives
from the nfs servers to ceph in a N+2 ECC way, similarly to the RAID6 that i have now
and with OSD level failure domain)
but first i have to get the funding and hardware for the pilot installation :)

usually is ~40 hours
what is different, and what are the potential problems that can explode to data loss?

HDD dirty little secret # 437:  slow healing means an extended period of risky degraded redundancy.  Especially if you get stuck with SMR drives.
i do not use SMR, i have only the normal CMR, and this 30-40 hours is the 10+2 RAID6 rebuilding time
and yes, there is always possibility to lose 2 drives in partition, so i have to take out the RAID from production
and really hope that nothing will happen until i get in data-center to change the drives :D

I would expect much more than 40 hours for a 22TB spinner - Ceph tries to limit recovery speed so that clients aren’t DoSed.   Erasure coding exacerbates this situation.  I’ve seen an 8TB OSD take 4 weeks to backfill when throttled enough to allow client traffic.  HDDs have slow, narrow interfaces with rotational / seek latency and thus are a false economy.
well, then this is worrisome for me .. as i said above, in a 12 disks RAID6 array, the rebuild time for 1 disk is under
40 hours (but it's true that i make the partition RO so no additional writes hammers the drives, only reads)

(with some drives kept as spares..)
This isn’t crummy RAID ;).  You generally deploy OSDs on all drives and let Ceph grow new replicas to heal if it needs to.
ok.. but is there some kind of space reservation mechanism that would allow that spare space to be used only
when pool needs healing?

Spare space is unused space.  There is no distinction.  Unlike many RAID implementations Ceph does not resilver blindly at drive granularity.
i see, got it

Ceph has (adjustable) ratios for OSD fullness.  You generally want to maintain enough unused space to allow healing when drives fail.  If one is used to embedded or even software RAID this idea can take some time to get.  It makes Ceph WAY more flexible, e.g. you don’t have to maintain exact-size spare drives.  Though there are advantages to having not having huge variation in size.
oh, that means that you can do something like "use at most 90% of space for normal opperations, but if
there is a healing balancing use the rest of 10% until new OSD is in place and balancing will free that 10%" ?

I am almost certain that from the point of view of ceph, what i'm thinking is wrong
so i would love to receive some advice :)
Learning Ceph - Second Edition: Unifed, scalable, and reliable open source storage solution
https://a.co/d/9AwlerS <https://a.co/d/9AwlerS>
ooh!! great, thanks a lot for info! :)

Best book about Ceph ever written ;)
Ha! I will make sure to verify the claim :)))

Thanks a lot!!
Adrian

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx