prototyping an osd crush freeze tool

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Mon, 4 Jun 2018 15:55:22 +0200

Hi all,

Back at the April CDM [1] we proposed a sort of "osd crush freeze"
tool that would use upmap to let operators more easily phase-in new
tunables or make large crush tree changes gradually.

Theo here at CERN has prototyped this

   https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap-newup-to-oldup.py

and it already seems quite usable but we'd like some feedback how to
best proceed.

(When this idea was first proposed, it was sort of agreed that it
would be better to implement mutable health warnings, but this was so
easy (and I think useful....) that we went ahead anyway...)

Here's how it works:

1. Cluster is HEALTH_OK, we want to set tunables from hammer to optimal.
2. Run the script: it sets some nobalance flags and saves the current
PG's then prompts you to make a big crush change:

# ./upmap-newup-to-oldup.py
Snapshotting the PG state...
norebalance is set
norecover is set
nobackfill is set
Do the change, wait for ceph status to stabilize, then yes/no to
continue or exit (yes/no or y/n):

3. We do the big crush change, in this case `ceph osd crush tunables
optimal`. On our test cluster this stabilized at:

    health: HEALTH_WARN
            nobackfill,norebalance,norecover flag(s) set
            23736/1291065 objects misplaced (1.838%)
            Degraded data redundancy: 812037/1291065 objects degraded
(62.897%), 2794 pgs degraded

4. We type 'yes' to continue, and the tool upmaps (via the ceph cli)
the PGs back to where they are presently.
Here's some sample output:

Do the change, wait for ceph status to stabilize, then yes/no to
continue or exit (yes/no or y/n): y
...
set 51.11f pg_upmap mapping to [211,82,106]
set 50.197 pg_upmap mapping to [66,102,207]
set 51.201 pg_upmap mapping to [337,22,233]
...
set 52.2ac pg_upmap mapping to [120,300,193]
set 52.25f pg_upmap mapping to [216,142,294]
norebalance is unset
norecover is unset
nobackfill is unset
#

5. The tool finishes after a couple of minutes. By the time the script
exits we have `health: HEALTH_OK`.
Note that no actual data moved.

6. Now, at our leisure, we slowly remove the pg_upmap's to gradually
bring the new tunables into full effect.

Questions:
1. We think this would be useful, most notably for upgrading tunables
or adding/removing entire hosts or racks to a cluster. Does anyone
else think this would be useful?
2. Is there anything we're missing that should prevent us from further
developing this idea?
3. Any interest in this being added to ceph proper, e.g. as an mgr
module? (Being internal to Ceph we could apply the upmaps much more
quickly rather than relying on the slow cli).
4. We weren't sure if we should use pg-upmap or pg-upmap-items here...
Can someone advise us?

Cheers,

Theo & Dan

[1] https://pad.ceph.com/p/cephalocon-usability-brainstorming
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html