Re: Problems with autoscaler (overlapping roots) after changing the pool class

Rok Jaklič <rjaklic@xxxxxxxxx> · Mon, 23 Dec 2024 16:05:29 +0100

That would be then replicated_rule

[root@ctplmon1 ~]# ceph osd pool ls detail
pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash
rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 320144 flags
hashpspool stripe_width 0 pg_num_min 1 application mgr,mgr_devicehealth
pool 2 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash
rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 320144 lfor
0/18964/18962 flags hashpspool stripe_width 0 application rgw
pool 3 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
320144 lfor 0/127672/127670 flags hashpspool stripe_width 0 application rgw
pool 4 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
320144 lfor 0/59850/59848 flags hashpspool stripe_width 0 application rgw
pool 5 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change
320144 lfor 0/51538/51536 flags hashpspool stripe_width 0 pg_autoscale_bias
4 pg_num_min 8 application rgw
pool 6 'default.rgw.buckets.index' replicated size 3 min_size 2 crush_rule
2 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change
315285 lfor 0/127830/127828 flags hashpspool stripe_width 0
pg_autoscale_bias 4 pg_num_min 8 application rgw
pool 7 'default.rgw.buckets.non-ec' replicated size 3 min_size 2 crush_rule
0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
320144 lfor 0/76474/76472 flags hashpspool stripe_width 0 application rgw
pool 9 'default.rgw.buckets.data' erasure profile ec-32-profile size 5
min_size 4 crush_rule 1 object_hash rjenkins pg_num 512 pgp_num 512
autoscale_mode on last_change 320144 lfor 0/127784/214408 flags
hashpspool,ec_overwrites stripe_width 12288 application rgw
pool 10 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 object_hash
rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 320144 flags
hashpspool,bulk stripe_width 0 application cephfs
pool 11 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 4
object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change
320144 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16
recovery_priority 5 application cephfs

[root@ctplmon1 ~]# ceph osd crush rule dump
[
    {
        "rule_id": 0,
        "rule_name": "replicated_rule",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },

I would go then adding "class ssd" after "step take default" in crushmap
just like in https://www.spinics.net/lists/ceph-users/msg84987.html

# rules
rule replicated_rule {
        id 0
        type replicated
        step take default class ssd
        step chooseleaf firstn 0 type host
        step emit
}

---

I expect that data will move from hdd to ssd.

Will services be unavailable during movement?

Or is it maybe safe also just to set pools to another rule:
    {
        "rule_id": 2,
        "rule_name": "replicated_ssd",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -18,
                "item_name": "default~ssd"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },

is it maybe another option just to reset pool crush_rule e.g.:

ceph osd pool set .mgr crush_rule replicated_ssd ?

Rok

On Mon, Dec 23, 2024 at 3:12 PM Eugen Block <eblock@xxxxxx> wrote:

> Don't try to delete a root, that will definitely break something.
> Instead, check the crush rules which don't use a device class and use
> the reclassify of the crushtool to modify the rules. This will trigger
> only a bit of data movement, but not as much as a simple change of the
> rule would.
>
> Zitat von Rok Jaklič <rjaklic@xxxxxxxxx>:
>
> > I got a similar problem after changing pool class to use only hdd
> following
> > https://www.spinics.net/lists/ceph-users/msg84987.html. Data migrated
> > successfully.
> >
> > I get warnings like:
> > 2024-12-23T14:39:37.103+0100 7f949edad640  0 [pg_autoscaler WARNING root]
> > pool default.rgw.buckets.index won't scale due to overlapping roots: {-1,
> > -18}
> > 2024-12-23T14:39:37.105+0100 7f949edad640  0 [pg_autoscaler WARNING root]
> > pool default.rgw.buckets.data won't scale due to overlapping roots: {-2,
> > -1, -18}
> > 2024-12-23T14:39:37.107+0100 7f949edad640  0 [pg_autoscaler WARNING root]
> > pool cephfs_metadata won't scale due to overlapping roots: {-2, -1, -18}
> > 2024-12-23T14:39:37.111+0100 7f949edad640  0 [pg_autoscaler WARNING root]
> > pool 1 contains an overlapping root -1... skipping scaling
> > ...
> >
> > while crush tree with shadow shows:
> >  -2    hdd  1043.93188  root default~hdd
> >  -4    hdd   151.82336      host ctplosd1~hdd
> >   0    hdd     5.45798          osd.0
> >   1    hdd     5.45798          osd.1
> >   2    hdd     5.45798          osd.2
> >   3    hdd     5.45798          osd.3
> >   4    hdd     5.45798          osd.4
> > ...
> >  -1         1050.48230  root default
> >  -3          153.27872      host ctplosd1
> >   0    hdd     5.45798          osd.0
> >   1    hdd     5.45798          osd.1
> >   2    hdd     5.45798          osd.2
> >   3    hdd     5.45798          osd.3
> >   4    hdd     5.45798          osd.4
> > ...
> >
> > and even though crush rule for example for
> >
> > pool 9 'default.rgw.buckets.data' erasure profile ec-32-profile size 5
> > min_size 4 crush_rule 1 object_hash rjenkins pg_num 512 pgp_num 512
> > autoscale_mode on last_change 320144 lfor 0/127784/214408 flags
> > hashpspool,ec_overwrites stripe_width 12288 application rgw
> >
> > is set to:
> >         {
> >             "rule_id": 1,
> >             "rule_name": "ec32",
> >             "type": 3,
> >             "steps": [
> >                 {
> >                     "op": "set_chooseleaf_tries",
> >                     "num": 5
> >                 },
> >                 {
> >                     "op": "set_choose_tries",
> >                     "num": 100
> >                 },
> >                 {
> >                     "op": "take",
> >                     "item": -2,
> >                     "item_name": "default~hdd"
> >                 },
> >                 {
> >                     "op": "chooseleaf_indep",
> >                     "num": 0,
> >                     "type": "host"
> >                 },
> >                 {
> >                     "op": "emit"
> >                 }
> >             ]
> >         },
> >
> > and I still get warning messages.
> >
> > Is there a way I can check if a particular "root" is used somewhere other
> > than go thorough ceph osd pool ls detail and look into crush rule?
> >
> > Can I somehow delete "old" root default?
> >
> > Would it be safe to change pg manually even with overlapped roots?
> >
> > Rok
> >
> >
> > On Wed, Jan 25, 2023 at 12:03 PM Massimo Sgaravatto <
> > massimo.sgaravatto@xxxxxxxxx> wrote:
> >
> >> I tried the following on a small testbed first:
> >>
> >> ceph osd erasure-code-profile set profile-4-2-hdd k=4 m=2
> >> crush-failure-domain=host crush-device-class=hdd
> >> ceph osd crush rule create-erasure ecrule-4-2-hdd profile-4-2-hdd
> >> ceph osd pool set ecpool-4-2 crush_rule ecrule-4-2-hdd
> >>
> >> and indeed after having applied this change for all the EC pools, the
> >> autoscaler doesn't complain anymore
> >>
> >> Thanks a lot !
> >>
> >> Cheers, Massimo
> >>
> >> On Tue, Jan 24, 2023 at 7:02 PM Eugen Block <eblock@xxxxxx> wrote:
> >>
> >> > Hi,
> >> >
> >> > what you can’t change with EC pools is the EC profile, the pool‘s
> >> > ruleset you can change. The fix is the same as for the replicates
> >> > pools, assign a ruleset with hdd class and after some data movement
> >> > the autoscaler should not complain anymore.
> >> >
> >> > Regards
> >> > Eugen
> >> >
> >> > Zitat von Massimo Sgaravatto <massimo.sgaravatto@xxxxxxxxx>:
> >> >
> >> > > Dear all
> >> > >
> >> > > I have just changed the crush rule for all the replicated pools in
> the
> >> > > following way:
> >> > >
> >> > > ceph osd crush rule create-replicated replicated_hdd default host
> hdd
> >> > > ceph osd pool set  <poolname> crush_rule replicated_hdd
> >> > >
> >> > > See also this [*] thread
> >> > > Before applying this change, these pools were all using
> >> > > the replicated_ruleset rule where the class is not specified.
> >> > >
> >> > >
> >> > >
> >> > > I am noticing now a problem with the autoscaler: "ceph osd pool
> >> > > autoscale-status" doesn't report any output and the mgr log
> complains
> >> > about
> >> > > overlapping roots:
> >> > >
> >> > >  [pg_autoscaler ERROR root] pool xyz has overlapping roots: {-18,
> -1}
> >> > >
> >> > >
> >> > > Indeed:
> >> > >
> >> > > # ceph osd crush tree --show-shadow
> >> > > ID   CLASS  WEIGHT      TYPE NAME
> >> > > -18    hdd  1329.26501  root default~hdd
> >> > > -17    hdd   329.14154      rack Rack11-PianoAlto~hdd
> >> > > -15    hdd    54.56085          host ceph-osd-04~hdd
> >> > >  30    hdd     5.45609              osd.30
> >> > >  31    hdd     5.45609              osd.31
> >> > > ...
> >> > > ...
> >> > >  -1         1329.26501  root default
> >> > >  -7          329.14154      rack Rack11-PianoAlto
> >> > >  -8           54.56085          host ceph-osd-04
> >> > >  30    hdd     5.45609              osd.30
> >> > >  31    hdd     5.45609              osd.31
> >> > > ...
> >> > >
> >> > > I have already read about this behavior but  I have no clear ideas
> how
> >> to
> >> > > fix the problem.
> >> > >
> >> > > I read somewhere that the problem happens when there are rules that
> >> force
> >> > > some pools to only use one class and there are also pools which does
> >> not
> >> > > make any distinction between device classes
> >> > >
> >> > >
> >> > > All the replicated pools are using the replicated_hdd pool but I
> also
> >> > have
> >> > > some EC pools which are using a profile where the class is not
> >> specified.
> >> > > As far I understand, I can't force these pools to use only the hdd
> >> class:
> >> > > according to the doc I can't change this profile specifying the hdd
> >> class
> >> > > (or at least the change wouldn't be applied to the existing EC
> pools)
> >> > >
> >> > > Any suggestions ?
> >> > >
> >> > > The crush map is available at
> >> https://cernbox.cern.ch/s/gIyjbQbmoTFHCrr,
> >> > if
> >> > > you want to have a look
> >> > >
> >> > > Many thanks, Massimo
> >> > >
> >> > > [*] https://www.mail-archive.com/ceph-users@xxxxxxx/msg18534.html
> >> > > _______________________________________________
> >> > > ceph-users mailing list -- ceph-users@xxxxxxx
> >> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >> >
> >> >
> >> > _______________________________________________
> >> > ceph-users mailing list -- ceph-users@xxxxxxx
> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >> >
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>
>
>
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx