Changing crush map result in > 100% objects degraded

Kasper Rasmussen <kasper_steengaard@xxxxxxxxxxx> · Tue, 21 Jan 2025 12:59:45 +0000

Hi community

Please help me understand what is going on.

I have a ceph (Reef) test cluster with the following crushmap

ceph osd crush tree
ID   CLASS  WEIGHT    TYPE NAME
 -1         12.00000  root default
 -7          3.00000      host ksr-ceph-osd1
  0    hdd   1.00000          osd.0
  6    hdd   1.00000          osd.6
 10    hdd   1.00000          osd.10
 -9          3.00000      host ksr-ceph-osd2
  3    hdd   1.00000          osd.3
  7    hdd   1.00000          osd.7
 11    hdd   1.00000          osd.11
 -5          3.00000      host ksr-ceph-osd3
  2    hdd   1.00000          osd.2
  5    hdd   1.00000          osd.5
  9    hdd   1.00000          osd.9
 -3          3.00000      host ksr-ceph-osd4
  1    hdd   1.00000          osd.1
  4    hdd   1.00000          osd.4
  8    hdd   1.00000          osd.8
-11                0      rack rack1
-13                0      rack rack2
-15                0      rack rack3
-17                0      rack rack4

Ceph status is like:

  cluster:
    id:     8a174287-42f8-43b6-9973-f174110b508b
    health: HEALTH_OK

  services:
    mon: 5 daemons, quorum ksr-ceph-mon2,ksr-ceph-mon3,ksr-ceph-mon1,ksr-ceph-mon5,ksr-ceph-mon4 (age 3h)
    mgr: ksr-ceph-mon1(active, since 3h), standbys: ksr-ceph-mon2, ksr-ceph-mon3
    mds: 2/2 daemons up, 3 standby
    osd: 12 osds: 12 up (since 3h), 12 in (since 6d)

  data:
    volumes: 2/2 healthy
    pools:   5 pools, 129 pgs
    objects: 578 objects, 154 MiB
    usage:   1.0 GiB used, 599 GiB / 600 GiB avail
    pgs:     129 active+clean

I then run:
ceph osd set norecover
ceph osd set nobackfill
ceph osd set norebalance

ceph osd crush move ksr-ceph-osd1 rack=rack1
ceph osd crush move ksr-ceph-osd2 rack=rack2
ceph osd crush move ksr-ceph-osd3 rack=rack3
ceph osd crush move ksr-ceph-osd4 rack=rack4

resulting in the following crush tree

ceph osd crush tree
ID   CLASS  WEIGHT    TYPE NAME
 -1         12.00000  root default
-11          3.00000      rack rack1
 -7          3.00000          host ksr-ceph-osd1
  0    hdd   1.00000              osd.0
  6    hdd   1.00000              osd.6
 10    hdd   1.00000              osd.10
-13          3.00000      rack rack2
 -9          3.00000          host ksr-ceph-osd2
  3    hdd   1.00000              osd.3
  7    hdd   1.00000              osd.7
 11    hdd   1.00000              osd.11
-15          3.00000      rack rack3
 -5          3.00000          host ksr-ceph-osd3
  2    hdd   1.00000              osd.2
  5    hdd   1.00000              osd.5
  9    hdd   1.00000              osd.9
-17          3.00000      rack rack4
 -3          3.00000          host ksr-ceph-osd4
  1    hdd   1.00000              osd.1
  4    hdd   1.00000              osd.4
  8    hdd   1.00000              osd.8

And ceph status is like:

  cluster:
    id:     8a174287-42f8-43b6-9973-f174110b508b
    health: HEALTH_WARN
            nobackfill,norebalance,norecover flag(s) set
            Degraded data redundancy: 2701/1734 objects degraded (155.767%), 55 pgs degraded

  services:
    mon: 5 daemons, quorum ksr-ceph-mon2,ksr-ceph-mon3,ksr-ceph-mon1,ksr-ceph-mon5,ksr-ceph-mon4 (age 3h)
    mgr: ksr-ceph-mon1(active, since 3h), standbys: ksr-ceph-mon2, ksr-ceph-mon3
    mds: 2/2 daemons up, 3 standby
    osd: 12 osds: 12 up (since 3h), 12 in (since 6d); 22 remapped pgs
         flags nobackfill,norebalance,norecover

  data:
    volumes: 2/2 healthy
    pools:   5 pools, 129 pgs
    objects: 578 objects, 154 MiB
    usage:   1.0 GiB used, 599 GiB / 600 GiB avail
    pgs:     2701/1734 objects degraded (155.767%)
             479/1734 objects misplaced (27.624%)
             70 active+clean
             34 active+recovery_wait+degraded
             20 active+recovery_wait+undersized+degraded+remapped
             3  active+recovering
             1  active+recovery_wait+remapped
             1  active+recovery_wait+degraded+remapped

No crush rules have been changed on the pools. All pools has the default replicated_rule

ceph osd crush rule dump replicated_rule
{
    "rule_id": 0,
    "rule_name": "replicated_rule",
    "type": 1,
    "steps": [
        {
            "op": "take",
            "item": -1,
            "item_name": "default"
        },
        {
            "op": "chooseleaf_firstn",
            "num": 0,
            "type": "host"
        },
        {
            "op": "emit"
        }
    ]
}

Questions
1 - Why do this result in such a high - objects degraded - percentage?
2 - Why do PGs get undersized?

All in all this behavior does not make sense to me - I expect nothing to happen basically since I've only implemented some buckets in the map, no rules have changed -  so I reach out in hope that someone can explain me the logic.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx