Re: Preferred order of operations when changing crush map and pool rules

Reed Dier <reed.dier@xxxxxxxxxxx> · Tue, 30 Mar 2021 16:59:23 -0500

I've not undertaken such a large data movement,

The pgupmap script may be of use here, but assuming that its not.

But if I were, I would first take many backups of the current crush map.
I would set the noreblance and norecover flags.
Then I would verify all of the backfill settings are as aggressive as you expect them to be.
Then I would do the crush changes, that will then trigger backfill waits.
After verifying everything is as you expect, unset the flags, and let ceph do its thing.

And of course tweaking the backfill/recovery settings as needed to speed up/lighten the load.

Hope thats helpful.

Reed

> On Mar 30, 2021, at 8:00 AM, Thomas Hukkelberg <thomas@xxxxxxxxxxxxxxxxx> wrote:
> 
> Hi all!
> 
> We run a 1.5PB cluster with 12 hosts, 192 OSDs (mix of NVMe and HDD) and need to improve our failure domain by altering the crush rules and moving rack to pods, which would imply a lot of data movement.
> 
> I wonder what would the preferred order of operations be when doing such changes to the crush map and pools? Will there be minimal data movement by moving all racks to pods at once and change pool repl rules or is the best approach to first move racks one by one to pods and then change pool replication rules from rack to pods? Anyhow I guess it's good practice to set 'norebalance' before moving hosts and unset to start the actual moving?
> 
> Right now we have the following setup:
> 
> root -> rack2 -> ups1 + node51 + node57 + switch21
> root -> rack3 -> ups2 + node52 + node58 + switch22
> root -> rack4 -> ups3 + node53 + node59 + switch23
> root -> rack5 -> ups4 + node54 + node60 -- switch 21 ^^
> root -> rack6 -> ups5 + node55 + node61 -- switch 22 ^^
> root -> rack7 -> ups6 + node56 + node62 -- switch 23 ^^
> 
> Note that racks 5-7 are connected to same ToR switches as racks 2-4. Cluster and frontend network are in different VXLANs connected with dual 40GbE. Failure domain for 3x replicated pools are currently by rack, and after adding hosts 57-62 we realized that if one of the switches reboots or fails, replicated PGs located only on those 4 hosts will be unavailable and force pools offline. I guess the best way would instead like to organize the racks in pods like this:
> 
> root -> pod1 -> rack2 -> ups1 + node51 + node57
> root -> pod1 -> rack5 -> ups4 + node54 + node60 -> switch21
> root -> pod2 -> rack3 -> ups2 + node52 + node58
> root -> pod2 -> rack6 -> ups5 + node55 + node61 -> switch 22
> root -> pod3 -> rack4 -> ups3 + node53 + node59
> root -> pod3 -> rack7 -> ups6 + node56 + node62 -> switch 23
> 
> The reason for this arrangement is that we in the future plan to organize the pods in different buildings. We're running nautilus 14.2.16 and are about to upgrade to Octopus. Should we upgrade to Octopus before crush changes? 
> 
> Any thoughts or insight on how to achieve this with minimal data movement and risk of cluster downtime would be welcome!
> 
> 
> --thomas
> 
> --
> Thomas Hukkelberg
> thomas@xxxxxxxxxxxxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx