Re: Crush rule examples

Janne Johansson <icepic.dz@xxxxxxxxx> · Fri, 22 Nov 2024 10:13:50 +0100

Den tors 21 nov. 2024 kl 19:18 skrev Andre Tann <atann@xxxxxxxxxxxx>:
> > This post seem to show that, except they have their root named "nvme"
> > and they split on rack and not dc, but that is not important.
> >
> > https://unix.stackexchange.com/questions/781250/ceph-crush-rules-explanation-for-multiroom-racks-setup
>
> This is indeed a good example, thanks.
> Let me put some thoughts/questions here:
>
> > step choose firstn 2 type rack
>
> This choses 2 racks out of all available racks. As there are 2 racks
> available, all are chosen.

Yes, and you would name it DC instead of course.

> > step chooseleaf firstn 2 type host
>
> For each selected rack from the previous step, 2 hosts are chosen. But
> as the action is "chooseleaf", in fact not the hosts are picked, but one
> random (?) OSD in each of the 2 selected hosts.

Well, it picks a leaf out of the host, which is a branch in the tree.
I see it as
after picking the host, don't do anything special but just grab an OSD
from there.

> In the end we have 4 OSDs in 4 different hosts, 2 in each rack.
> Is this understanding correct?

I believe so, yes.

> Shouldn't we note this one additionally:
>
>         min_size 4

Not necessary, you could allow for min_size 3 so that single-drive
problems doesn't cause the PG to stop.

>         max_size 4
>
> Reason: If we wanted to place more ore less than 4 replicas, the rule
> won't work. Or what would happen if we don't specify min/max_size?
> Should lead to an error in case the pool is e.g. size=5, shouldn't it?

Yes, but when you figure you need a repl=5 pool you would have to make a rule
that picks 3 from one DC. I'm sure there is a way to say "..and then you pick as
many hosts as needed", but I don't know it offhand. Might be that the above rule
would allow 5 copies, but the fifth ends up on the same host as the
one of the others.

> One last question: if we edit a crush map after a pool was created on
> it, what happens? In my understanding, this  lead to massive data
> shifting so that the placements comply with the new rules. That right?

Yes, but it can be mitigated somewhat using the remappers and let the balancer
slowly do the changes.

1. set norebalance
2. stop the balancer
3. apply the new crush rule on pool
4. let the mons figure out all new places for the PGs
5. run one of the remapper tools, jj-balancer, upmap-remapper.py or the
golang pgremapper, which makes most (sometimes all) PGs
think they are in the correct place after all
6. unset norebalance
7. start the ceph balancer with a setting of max misplaced % that suits the
load you want to have during the moves.

-- 
May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx