Re: classes crush rules new cluster

Eugen Block <eblock@xxxxxx> · Fri, 29 Nov 2024 15:28:29 +0000

Andre,

see responses inline.

Zitat von Andre Tann <atann@xxxxxxxxxxxx>:

Ahoi Eugen,

Am 29.11.24 um 11:31 schrieb Eugen Block:

step set_chooseleaf_tries 5 -> stick to defaults, usually works  
(number of max attempts to find suitable OSDs)

Why do we need more than one attempt to find an OSD? Why is the  
result different if we walk through a rule more than once?

There have been cases with a large number of OSDs where crush "gave up  
too soon". Although I haven't read about that in quite a while, it may  
or may not still be an issue.

step take default class test -> "default" is the usual default  
crush root (check 'ceph osd tree'), you can specify other roots if  
you have them

where are these classes defined? Or is "default class test" the name  
of a root? Most probably not.

You define those classes. By default, Ceph creates a "default" entry  
point into the crush tree of type "root":

ceph osd tree | head -2
ID  CLASS  WEIGHT   TYPE NAME            STATUS  REWEIGHT  PRI-AFF
-1         0.14648  root default

You can create multiple roots with arbitrary names. Those roots can be  
addressed in crush rules. Before there were device classes, users  
split their trees into multiple roots, for example one for HDD, one  
for SSD devices.

Could I also say step take default type host?

I haven't tried that, I would assume that the entry point still has to  
be a bucket of type "root". I encourage you to play around in a lab  
cluster to get familiar with crushmaps and especially the crushtool,  
you'll benefit from it.

What are the keywords that are allowed after the root's name?

Fair question, I'm only aware of "class XYZ", so the device classes. I  
haven't checked in detail though.

step chooseleaf indep 0 type host -> within bucket "root" (from  
"step take default") choose {pool-num-replicas} hosts

What if I did exactly this, but have nested fault domains (e.g.  
racks > hosts)? Would the rule then pick {pool-num-replicas} hosts  
out of different racks, even though this rule doesn't mention racks  
anywhere?

Since I don't have racks in my lab cluster, I don't specify them. You  
need to modify your rule(s) according to your infrastructure, my  
example was just a simple one from one of my lab clusters.

But what if I have size=4, but only two racks, would the picked  
hosts spread evenly across the two racks, or randomly, like 1 host  
in one rack, 3 in the other, or all 4 in one rack?

You can (and most likely will) end up with the random result if you  
don't specifically tell crush what to do.

Assume a pool with size=4, could I say

  step take default
  choose firstn 1 type row
  choose firstn 3 type racks
  chooseleaf firstn 0 type host

Meaning:
- force all chunks of a pg in one row
- force all chunks in exactly three racks inside this row
- out of these three racks, pick 4 hosts

I don't want to say that the latter makes much sense, I just wonder  
if it would work that way.

I think it would, but again, give it a try. You can create "virtual"  
rows and racks, just add the respective buckets to the crushmap (of  
your test cluster):

ceph osd crush add-bucket row1 row root=default
added bucket row1 type row to location {root=default}

ceph osd crush add-bucket rack1 rack row=row1
added bucket rack1 type rack to location {row=row1}

ceph osd crush add-bucket rack2 rack row=row1
added bucket rack2 type rack to location {row=row1}

ceph osd crush add-bucket rack3 rack row=row1
added bucket rack3 type rack to location {row=row1}

Then move some of your hosts into the racks with ceph 'osd crush  
move...' and test your crush rules.

--
Andre Tann
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx