Re: Problems with autoscaler (overlapping roots) after changing the pool class

Rok Jaklič <rjaklic@xxxxxxxxx> · Mon, 6 Jan 2025 11:47:32 +0100

I followed then Eugens advice with something like:

   - create a new rule via CLI which includes a device class
   - dump the crushmap again and test the new rule with crushtool
   - If the output is as expected, assign the new rule to a pool of your
   choice, I'd start with a less important one.
   - If everything's good, do the same for all necessary pools, wait for
   remapping to finish.
   - No pool should be using the default "replicated_rule" now.
   - Dump a fresh crushmap and decompile it
      - add a "class hdd" entry to the default replicated_rule
      - save and compile
   - inject the modified crushmap (with this single change), nothing should
   happen in the cluster since no pool should use the replicated_rule at that
   point.

...

After I've changed pools to a new rule, remapping started, warning messages
do not show any more  and PGs for pools are increasing.

Thank you all for the help.

Rok

On Tue, Dec 24, 2024 at 2:16 AM Anthony D'Atri <anthony.datri@xxxxxxxxx>
wrote:

> If your NVMe OSDs have the `ssd` device class, doing what you suggest
> might not even result in any data movement.
>
> docs.ceph.com
> <https://docs.ceph.com/en/reef/rados/operations/crush-map-edits/#migrating-from-a-legacy-ssd-rule-to-device-classes>
>
> <https://docs.ceph.com/en/reef/rados/operations/crush-map-edits/#migrating-from-a-legacy-ssd-rule-to-device-classes>
> <https://docs.ceph.com/en/reef/rados/operations/crush-map-edits/#migrating-from-a-legacy-ssd-rule-to-device-classes>
>
> This page shows how to use the reclassify feature to help avoid typos when
> editing the CRUSHmap.  Using a CLI tool when feasible makes this sort of
> thing a lot safer, compared to back in the day when we had to text-edit
> everything by hand :nailbiting:.  One can readily diff the before and after
> decompiled text CRUSHmaps to ensure sanity before recompiling and injecting.
>
> I’ve done this myself multiple times since device classes became a thing.
>
>
>
> On Dec 23, 2024, at 5:05 PM, Rok Jaklič <rjaklic@xxxxxxxxx> wrote:
>
> I will try changing/adding class ssd to replicated_rule tomorrow even
> though I am a little hesitant for some reason to edit this rule since it
> could mean that system data for rgw would "stay somewhere" if something
> goes wrong. I was much braver when I was changing the rule for EC32 where I
> separated OSD data to just hdd, since "some data" was already on hdd.
>
>
> On Mon, Dec 23, 2024 at 4:12 PM Anthony D'Atri <anthony.datri@xxxxxxxxx>
> wrote:
>
> Agreed.  The .mgr pool is a usual suspect here, especially when using
> Rook.  When any pool is constrained to a device class, this kind of warning
> will happen if *all* pools don’t specify one.
>
> Of course there’s also the strategy of disabling the autoscaler, but that
> takes more analysis.  We old farts are used to it, but it can be daunting
> for whippersnappers.
>
> On Dec 23, 2024, at 9:11 AM, Eugen Block <eblock@xxxxxx> wrote:
>
> Don't try to delete a root, that will definitely break something.
>
> Instead, check the crush rules which don't use a device class and use the
> reclassify of the crushtool to modify the rules. This will trigger only a
> bit of data movement, but not as much as a simple change of the rule would.
>
>
> Zitat von Rok Jaklič <rjaklic@xxxxxxxxx>:
>
> I got a similar problem after changing pool class to use only hdd
>
> following
>
> https://www.spinics.net/lists/ceph-users/msg84987.html. Data migrated
> successfully.
>
> I get warnings like:
> 2024-12-23T14:39:37.103+0100 7f949edad640  0 [pg_autoscaler WARNING
>
> root]
>
> pool default.rgw.buckets.index won't scale due to overlapping roots:
>
> {-1,
>
> -18}
> 2024-12-23T14:39:37.105+0100 7f949edad640  0 [pg_autoscaler WARNING
>
> root]
>
> pool default.rgw.buckets.data won't scale due to overlapping roots: {-2,
> -1, -18}
> 2024-12-23T14:39:37.107+0100 7f949edad640  0 [pg_autoscaler WARNING
>
> root]
>
> pool cephfs_metadata won't scale due to overlapping roots: {-2, -1, -18}
> 2024-12-23T14:39:37.111+0100 7f949edad640  0 [pg_autoscaler WARNING
>
> root]
>
> pool 1 contains an overlapping root -1... skipping scaling
> ...
>
> while crush tree with shadow shows:
> -2    hdd  1043.93188  root default~hdd
> -4    hdd   151.82336      host ctplosd1~hdd
> 0    hdd     5.45798          osd.0
> 1    hdd     5.45798          osd.1
> 2    hdd     5.45798          osd.2
> 3    hdd     5.45798          osd.3
> 4    hdd     5.45798          osd.4
> ...
> -1         1050.48230  root default
> -3          153.27872      host ctplosd1
> 0    hdd     5.45798          osd.0
> 1    hdd     5.45798          osd.1
> 2    hdd     5.45798          osd.2
> 3    hdd     5.45798          osd.3
> 4    hdd     5.45798          osd.4
> ...
>
> and even though crush rule for example for
>
> pool 9 'default.rgw.buckets.data' erasure profile ec-32-profile size 5
> min_size 4 crush_rule 1 object_hash rjenkins pg_num 512 pgp_num 512
> autoscale_mode on last_change 320144 lfor 0/127784/214408 flags
> hashpspool,ec_overwrites stripe_width 12288 application rgw
>
> is set to:
>       {
>           "rule_id": 1,
>           "rule_name": "ec32",
>           "type": 3,
>           "steps": [
>               {
>                   "op": "set_chooseleaf_tries",
>                   "num": 5
>               },
>               {
>                   "op": "set_choose_tries",
>                   "num": 100
>               },
>               {
>                   "op": "take",
>                   "item": -2,
>                   "item_name": "default~hdd"
>               },
>               {
>                   "op": "chooseleaf_indep",
>                   "num": 0,
>                   "type": "host"
>               },
>               {
>                   "op": "emit"
>               }
>           ]
>       },
>
> and I still get warning messages.
>
> Is there a way I can check if a particular "root" is used somewhere
>
> other
>
> than go thorough ceph osd pool ls detail and look into crush rule?
>
> Can I somehow delete "old" root default?
>
> Would it be safe to change pg manually even with overlapped roots?
>
> Rok
>
>
> On Wed, Jan 25, 2023 at 12:03 PM Massimo Sgaravatto <
> massimo.sgaravatto@xxxxxxxxx> wrote:
>
> I tried the following on a small testbed first:
>
> ceph osd erasure-code-profile set profile-4-2-hdd k=4 m=2
> crush-failure-domain=host crush-device-class=hdd
> ceph osd crush rule create-erasure ecrule-4-2-hdd profile-4-2-hdd
> ceph osd pool set ecpool-4-2 crush_rule ecrule-4-2-hdd
>
> and indeed after having applied this change for all the EC pools, the
> autoscaler doesn't complain anymore
>
> Thanks a lot !
>
> Cheers, Massimo
>
> On Tue, Jan 24, 2023 at 7:02 PM Eugen Block <eblock@xxxxxx> wrote:
>
> Hi,
>
> what you can’t change with EC pools is the EC profile, the pool‘s
> ruleset you can change. The fix is the same as for the replicates
> pools, assign a ruleset with hdd class and after some data movement
> the autoscaler should not complain anymore.
>
> Regards
> Eugen
>
> Zitat von Massimo Sgaravatto <massimo.sgaravatto@xxxxxxxxx>:
>
> Dear all
>
> I have just changed the crush rule for all the replicated pools in
>
> the
>
> following way:
>
> ceph osd crush rule create-replicated replicated_hdd default host
>
> hdd
>
> ceph osd pool set  <poolname> crush_rule replicated_hdd
>
> See also this [*] thread
> Before applying this change, these pools were all using
> the replicated_ruleset rule where the class is not specified.
>
>
>
> I am noticing now a problem with the autoscaler: "ceph osd pool
> autoscale-status" doesn't report any output and the mgr log
>
> complains
>
> about
>
> overlapping roots:
>
> [pg_autoscaler ERROR root] pool xyz has overlapping roots: {-18,
>
> -1}
>
>
>
> Indeed:
>
> # ceph osd crush tree --show-shadow
> ID   CLASS  WEIGHT      TYPE NAME
> -18    hdd  1329.26501  root default~hdd
> -17    hdd   329.14154      rack Rack11-PianoAlto~hdd
> -15    hdd    54.56085          host ceph-osd-04~hdd
> 30    hdd     5.45609              osd.30
> 31    hdd     5.45609              osd.31
> ...
> ...
> -1         1329.26501  root default
> -7          329.14154      rack Rack11-PianoAlto
> -8           54.56085          host ceph-osd-04
> 30    hdd     5.45609              osd.30
> 31    hdd     5.45609              osd.31
> ...
>
> I have already read about this behavior but  I have no clear ideas
>
> how
>
> to
>
> fix the problem.
>
> I read somewhere that the problem happens when there are rules that
>
> force
>
> some pools to only use one class and there are also pools which
>
> does
>
> not
>
> make any distinction between device classes
>
>
> All the replicated pools are using the replicated_hdd pool but I
>
> also
>
> have
>
> some EC pools which are using a profile where the class is not
>
> specified.
>
> As far I understand, I can't force these pools to use only the hdd
>
> class:
>
> according to the doc I can't change this profile specifying the hdd
>
> class
>
> (or at least the change wouldn't be applied to the existing EC
>
> pools)
>
>
> Any suggestions ?
>
> The crush map is available at
>
> https://cernbox.cern.ch/s/gIyjbQbmoTFHCrr,
>
> if
>
> you want to have a look
>
> Many thanks, Massimo
>
> [*] https://www.mail-archive.com/ceph-users@xxxxxxx/msg18534.html
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx