Re: prototyping an osd crush freeze tool

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 4 Jun 2018 13:05:00 -0700

On Mon, Jun 4, 2018 at 6:55 AM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
> Hi all,
>
> Back at the April CDM [1] we proposed a sort of "osd crush freeze"
> tool that would use upmap to let operators more easily phase-in new
> tunables or make large crush tree changes gradually.
>
> Theo here at CERN has prototyped this
>
>    https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap-newup-to-oldup.py
>
> and it already seems quite usable but we'd like some feedback how to
> best proceed.
>
> (When this idea was first proposed, it was sort of agreed that it
> would be better to implement mutable health warnings, but this was so
> easy (and I think useful....) that we went ahead anyway...)
>
> Here's how it works:
>
> 1. Cluster is HEALTH_OK, we want to set tunables from hammer to optimal.
> 2. Run the script: it sets some nobalance flags and saves the current
> PG's then prompts you to make a big crush change:
>
> # ./upmap-newup-to-oldup.py
> Snapshotting the PG state...
> norebalance is set
> norecover is set
> nobackfill is set
> Do the change, wait for ceph status to stabilize, then yes/no to
> continue or exit (yes/no or y/n):
>
> 3. We do the big crush change, in this case `ceph osd crush tunables
> optimal`. On our test cluster this stabilized at:
>
>     health: HEALTH_WARN
>             nobackfill,norebalance,norecover flag(s) set
>             23736/1291065 objects misplaced (1.838%)
>             Degraded data redundancy: 812037/1291065 objects degraded
> (62.897%), 2794 pgs degraded
>
> 4. We type 'yes' to continue, and the tool upmaps (via the ceph cli)
> the PGs back to where they are presently.
> Here's some sample output:
>
> Do the change, wait for ceph status to stabilize, then yes/no to
> continue or exit (yes/no or y/n): y
> ...
> set 51.11f pg_upmap mapping to [211,82,106]
> set 50.197 pg_upmap mapping to [66,102,207]
> set 51.201 pg_upmap mapping to [337,22,233]
> ...
> set 52.2ac pg_upmap mapping to [120,300,193]
> set 52.25f pg_upmap mapping to [216,142,294]
> norebalance is unset
> norecover is unset
> nobackfill is unset
> #
>
> 5. The tool finishes after a couple of minutes. By the time the script
> exits we have `health: HEALTH_OK`.
> Note that no actual data moved.
>
> 6. Now, at our leisure, we slowly remove the pg_upmap's to gradually
> bring the new tunables into full effect.
>
> Questions:
> 1. We think this would be useful, most notably for upgrading tunables
> or adding/removing entire hosts or racks to a cluster. Does anyone
> else think this would be useful?
> 2. Is there anything we're missing that should prevent us from further
> developing this idea?

My concern here, and it's probably easy for you to test against, is
just that this is a big stress on the upmap mechanism. Obviously we
want it to be able to handle this, but you are basically inserting a
mapping for every PG in the cluster. Is upmap implemented carefully
and efficiently enough that this is okay? Are we sure it doesn't
inflate the size of the osdmap so far as to cause problems with
processing (or disk throughput) like we saw in the dumpling era with
pgtemp?

You are probably in a better position to test this empirically than
most people, so I'd go ahead and do that with some deliberate other
stress on the cluster, if you've got the time.
-Greg

> 3. Any interest in this being added to ceph proper, e.g. as an mgr
> module? (Being internal to Ceph we could apply the upmaps much more
> quickly rather than relying on the slow cli).
> 4. We weren't sure if we should use pg-upmap or pg-upmap-items here...
> Can someone advise us?
>
> Cheers,
>
> Theo & Dan
>
> [1] https://pad.ceph.com/p/cephalocon-usability-brainstorming
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html