Re: prototyping an osd crush freeze tool

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Mon, 4 Jun 2018 16:54:33 +0200

(re-adding list)

On Mon, Jun 4, 2018 at 4:41 PM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote:
>
> Thoughts:
>
> o Love the idea of being able to update to optimal tunables, consolidate on straw2 buckets, etc.  In the past had to live with legacies for fear of DoSing the cluster.

Exactly :) It's a "have no fear" tool...

> o All clients must be on Luminous, right?  Would be really nice to add a step to the beginning to ensure this before proceeding.

Correct. We can indeed add this check.

> o For adding racks/hosts the gentle-reweight script works well for us, though there’s a certain appeal to using one tool for all changes.  To which end an eventual merge of the two scripts seems natural: removing N upmaps at a time, then waiting for time, backfilling cessation, etc. before removing the next set.
>

That's precisely what we had in mind.

Thanks for the feedback!

-- Dan

> >
> > Hi all,
> >
> > Back at the April CDM [1] we proposed a sort of "osd crush freeze"
> > tool that would use upmap to let operators more easily phase-in new
> > tunables or make large crush tree changes gradually.
> >
> > Theo here at CERN has prototyped this
> >
> >   https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap-newup-to-oldup.py
> >
> > and it already seems quite usable but we'd like some feedback how to
> > best proceed.
> >
> > (When this idea was first proposed, it was sort of agreed that it
> > would be better to implement mutable health warnings, but this was so
> > easy (and I think useful....) that we went ahead anyway...)
> >
> > Here's how it works:
> >
> > 1. Cluster is HEALTH_OK, we want to set tunables from hammer to optimal.
> > 2. Run the script: it sets some nobalance flags and saves the current
> > PG's then prompts you to make a big crush change:
> >
> > # ./upmap-newup-to-oldup.py
> > Snapshotting the PG state...
> > norebalance is set
> > norecover is set
> > nobackfill is set
> > Do the change, wait for ceph status to stabilize, then yes/no to
> > continue or exit (yes/no or y/n):
> >
> > 3. We do the big crush change, in this case `ceph osd crush tunables
> > optimal`. On our test cluster this stabilized at:
> >
> >    health: HEALTH_WARN
> >            nobackfill,norebalance,norecover flag(s) set
> >            23736/1291065 objects misplaced (1.838%)
> >            Degraded data redundancy: 812037/1291065 objects degraded
> > (62.897%), 2794 pgs degraded
> >
> > 4. We type 'yes' to continue, and the tool upmaps (via the ceph cli)
> > the PGs back to where they are presently.
> > Here's some sample output:
> >
> > Do the change, wait for ceph status to stabilize, then yes/no to
> > continue or exit (yes/no or y/n): y
> > ...
> > set 51.11f pg_upmap mapping to [211,82,106]
> > set 50.197 pg_upmap mapping to [66,102,207]
> > set 51.201 pg_upmap mapping to [337,22,233]
> > ...
> > set 52.2ac pg_upmap mapping to [120,300,193]
> > set 52.25f pg_upmap mapping to [216,142,294]
> > norebalance is unset
> > norecover is unset
> > nobackfill is unset
> > #
> >
> > 5. The tool finishes after a couple of minutes. By the time the script
> > exits we have `health: HEALTH_OK`.
> > Note that no actual data moved.
> >
> > 6. Now, at our leisure, we slowly remove the pg_upmap's to gradually
> > bring the new tunables into full effect.
> >
> > Questions:
> > 1. We think this would be useful, most notably for upgrading tunables
> > or adding/removing entire hosts or racks to a cluster. Does anyone
> > else think this would be useful?
> > 2. Is there anything we're missing that should prevent us from further
> > developing this idea?
> > 3. Any interest in this being added to ceph proper, e.g. as an mgr
> > module? (Being internal to Ceph we could apply the upmaps much more
> > quickly rather than relying on the slow cli).
> > 4. We weren't sure if we should use pg-upmap or pg-upmap-items here...
> > Can someone advise us?
> >
> > Cheers,
> >
> > Theo & Dan
> >
> > [1] https://pad.ceph.com/p/cephalocon-usability-brainstorming
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html