Den tors 21 nov. 2024 kl 19:18 skrev Andre Tann <atann@xxxxxxxxxxxx>: > > This post seem to show that, except they have their root named "nvme" > > and they split on rack and not dc, but that is not important. > > > > https://unix.stackexchange.com/questions/781250/ceph-crush-rules-explanation-for-multiroom-racks-setup > > This is indeed a good example, thanks. > Let me put some thoughts/questions here: > > > step choose firstn 2 type rack > > This choses 2 racks out of all available racks. As there are 2 racks > available, all are chosen. Yes, and you would name it DC instead of course. > > step chooseleaf firstn 2 type host > > For each selected rack from the previous step, 2 hosts are chosen. But > as the action is "chooseleaf", in fact not the hosts are picked, but one > random (?) OSD in each of the 2 selected hosts. Well, it picks a leaf out of the host, which is a branch in the tree. I see it as after picking the host, don't do anything special but just grab an OSD from there. > In the end we have 4 OSDs in 4 different hosts, 2 in each rack. > Is this understanding correct? I believe so, yes. > Shouldn't we note this one additionally: > > min_size 4 Not necessary, you could allow for min_size 3 so that single-drive problems doesn't cause the PG to stop. > max_size 4 > > Reason: If we wanted to place more ore less than 4 replicas, the rule > won't work. Or what would happen if we don't specify min/max_size? > Should lead to an error in case the pool is e.g. size=5, shouldn't it? Yes, but when you figure you need a repl=5 pool you would have to make a rule that picks 3 from one DC. I'm sure there is a way to say "..and then you pick as many hosts as needed", but I don't know it offhand. Might be that the above rule would allow 5 copies, but the fifth ends up on the same host as the one of the others. > One last question: if we edit a crush map after a pool was created on > it, what happens? In my understanding, this lead to massive data > shifting so that the placements comply with the new rules. That right? Yes, but it can be mitigated somewhat using the remappers and let the balancer slowly do the changes. 1. set norebalance 2. stop the balancer 3. apply the new crush rule on pool 4. let the mons figure out all new places for the PGs 5. run one of the remapper tools, jj-balancer, upmap-remapper.py or the golang pgremapper, which makes most (sometimes all) PGs think they are in the correct place after all 6. unset norebalance 7. start the ceph balancer with a setting of max misplaced % that suits the load you want to have during the moves. -- May the most significant bit of your life be positive. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx