Re: crushmap rules :: host selection

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


-------- Original Message --------
Subject:  crushmap rules :: host selection
From: Anthony D'Atri
To: Adrian Sevcenco
Date: 1/28/2024, 3:56:21 AM

First a all, thanks a lot for for info and taking time to help
a beginner :)

Pools are a logical name for a storage space but how can i specify
what osds or host are part of a pool?

Every pool specifies a CRUSH rule, which does this.  By default all OSDs will be used.  You can specify all, SSDs only, HDDs only, etc.  With a custom OSD device class you can select arbitrary hosts or OSDs.
Oh! so the device class is more like an arbitrary label not a immutable defined property!
looking at
this is not specified ...
So i can create arbitrary sets of OSDs on which a crush rule will be set
and when that crush rule will be applied on a pool, this will actually tie
the tagged OSDs (with the arbitrary class name) to pool.. did i get it right?

For replication, how can i specify: if a replica is missing (for a given time)
start rebuilding on some available OSD?


Is there a notion of "spare" so if an osd is missing on action, the rebuild to
start on another host and when the old OSD is back (the hdd is replaced, or the
machine was repaired) to be automatically cleaned up and used?

An available OSD will be selected to heal each PG according to the constraints in the pool’s CRUSH rule.  By default no PG will use more than one OSD on the same host.  It is common to use racks instead for this failure domain in the cluster is large enough and spread accordingly.
so it depends on failure domain .. but with host failure domain, if there is space on some other OSDs
will the missing OSDs be "healed" on the available space on some other OSDs?
also, what will happen with the old ones? i ask for 2 main scenarios:
1. 1 machine breaks : the drives are ok, let's say it's just and power distributor problem and then
the machine is put online after repair .. what will happen with the data on the OSDs?

2. a drive breaks : it is replaced, the drive is prepared and added with the same OSD number (as it is replaced)
presumably the data was already replicated/healed : what will happen with the OSD that now is empty?
will it be detected as replaced and just used?

I'm thinking about a 3 node cluster with the replica=2
failure domain = host, in such a way if one node is down, the data
from there to be replicated on the remaining nodes

If one node is down the PGs will remain undersized because OSDs must be on disjoint hosts. Oh wait, you wrote size=2.
and will the healing (rebuilding/resilvering) process start immediately ? or after some time?

 Don’t do that.  You will be likely to eventually lose data.  Use size=3 min_size=2.  If one node is down the PGs will be undersized but active.
hmm .. that means that there is a mecanism that i do not understand :)
with RAID1 with 2 devices, if one is down when the replacement is added it will take ~seq speed to rebuild the mirror
so for 22 TB usually is ~40 hours
what is different, and what are the potential problems that can explode to data loss?

(with some drives kept as spares..)

This isn’t crummy RAID ;).  You generally deploy OSDs on all drives and let Ceph grow new replicas to heal if it needs to.
ok.. but is there some kind of space reservation mechanism that would allow that spare space to be used only
when pool needs healing?

I am almost certain that from the point of view of ceph, what i'm thinking is wrong
so i would love to receive some advice :)

Learning Ceph - Second Edition: Unifed, scalable, and reliable open source storage solution <>
ooh!! great, thanks a lot for info! :)


Some nuances have changed since publication but the fundamentals are still fundamental.

Welcome to Ceph — Ceph Documentation <> <>
	favicon.ico <>


There is work underway to add a beginner’s guide.  Until then, I suggest search engines, this list, and the first four chapters of the above.
Yup, i will do that :)

Thanks a lot for help!

Thanks a lot!
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]

  Powered by Linux