Hi all, Back at the April CDM [1] we proposed a sort of "osd crush freeze" tool that would use upmap to let operators more easily phase-in new tunables or make large crush tree changes gradually. Theo here at CERN has prototyped this https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap-newup-to-oldup.py and it already seems quite usable but we'd like some feedback how to best proceed. (When this idea was first proposed, it was sort of agreed that it would be better to implement mutable health warnings, but this was so easy (and I think useful....) that we went ahead anyway...) Here's how it works: 1. Cluster is HEALTH_OK, we want to set tunables from hammer to optimal. 2. Run the script: it sets some nobalance flags and saves the current PG's then prompts you to make a big crush change: # ./upmap-newup-to-oldup.py Snapshotting the PG state... norebalance is set norecover is set nobackfill is set Do the change, wait for ceph status to stabilize, then yes/no to continue or exit (yes/no or y/n): 3. We do the big crush change, in this case `ceph osd crush tunables optimal`. On our test cluster this stabilized at: health: HEALTH_WARN nobackfill,norebalance,norecover flag(s) set 23736/1291065 objects misplaced (1.838%) Degraded data redundancy: 812037/1291065 objects degraded (62.897%), 2794 pgs degraded 4. We type 'yes' to continue, and the tool upmaps (via the ceph cli) the PGs back to where they are presently. Here's some sample output: Do the change, wait for ceph status to stabilize, then yes/no to continue or exit (yes/no or y/n): y ... set 51.11f pg_upmap mapping to [211,82,106] set 50.197 pg_upmap mapping to [66,102,207] set 51.201 pg_upmap mapping to [337,22,233] ... set 52.2ac pg_upmap mapping to [120,300,193] set 52.25f pg_upmap mapping to [216,142,294] norebalance is unset norecover is unset nobackfill is unset # 5. The tool finishes after a couple of minutes. By the time the script exits we have `health: HEALTH_OK`. Note that no actual data moved. 6. Now, at our leisure, we slowly remove the pg_upmap's to gradually bring the new tunables into full effect. Questions: 1. We think this would be useful, most notably for upgrading tunables or adding/removing entire hosts or racks to a cluster. Does anyone else think this would be useful? 2. Is there anything we're missing that should prevent us from further developing this idea? 3. Any interest in this being added to ceph proper, e.g. as an mgr module? (Being internal to Ceph we could apply the upmaps much more quickly rather than relying on the slow cli). 4. We weren't sure if we should use pg-upmap or pg-upmap-items here... Can someone advise us? Cheers, Theo & Dan [1] https://pad.ceph.com/p/cephalocon-usability-brainstorming -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html