degraded objects when setting different CRUSH rule on a pool, why?

Stefan Kooman <stefan@xxxxxx> · Wed, 5 Jun 2024 12:27:26 +0200

Hi,

TL;DR:

Selecting a different CRUSH rule (stretch_rule, no device class) for 
pool SSD results in degraded objects (unexpected) and misplaced objects 
(expected). Why would Ceph drop up to two healthy copies?

Consider this two data center cluster:

ID   CLASS  WEIGHT   TYPE NAME           STATUS  REWEIGHT  PRI-AFF
 -1         0.78400  root default
-10         0.39200      datacenter DC1
 -3         0.39200          host pve1
  0    hdd  0.09799              osd.0       up   1.00000  1.00000
  1    hdd  0.09799              osd.1       up   1.00000  1.00000
  4    ssd  0.09799              osd.4       up   1.00000  1.00000
  5    ssd  0.09799              osd.5       up   1.00000  1.00000
-11         0.39200      datacenter DC2
 -5         0.39200          host pve2
  2    hdd  0.09799              osd.2       up   1.00000  1.00000
  3    hdd  0.09799              osd.3       up   1.00000  1.00000
  6    ssd  0.09799              osd.6       up   1.00000  1.00000
  7    ssd  0.09799              osd.7       up   1.00000  1.00000

Pools available:

device_health_metrics
HDD
SSD

Let's focus on SSD for now. Crush rule in use for SSD pool:

rule SSD {
    id 2
    type replicated
    min_size 1
    max_size 10
    step take default class ssd
    step choose firstn 0 type host
    step chooseleaf firstn 0 type osd
    step emit
}

SSD pool replication settings: min_size=2, size=4

The new stretch rule to use:

rule stretch_rule {
    id 3
    type replicated
    min_size 1
    max_size 10
    step take default
    step take DC1
    step choose firstn 0 type host
    step chooseleaf firstn 2 type osd
    step emit
    step take default
    step take DC2
    step choose firstn 0 type host
    step chooseleaf firstn 2 type osd
    step emit
}

ceph pg ls for pool 3 (SSD):

3.0      184         0          0        0  738070528            0           0   7458  active+clean     9m   926'7458  951:18695  [7,6,5,4]p7  [7,6,5,4]p7  2024-06-05T08:52:51.951936+0200  2024-06-05T08:52:51.951936+0200
3.1      169         0          0        0  667156655            0           0   3242  active+clean    18m   926'3242  951:14151  [4,5,6,7]p4  [4,5,6,7]p4  2024-06-05T08:54:42.948682+0200  2024-06-05T08:54:42.948682+0200
3.2      221         0          0        0  885641968            0           0   6989  active+clean     9m   926'6989  951:17645  [4,5,7,6]p4  [4,5,7,6]p4  2024-06-05T08:53:27.981787+0200  2024-06-05T08:53:27.981787+0200
3.3      180         0          0        0  716509184            0           0   3194  active+clean     9m   926'3194  951:14191  [4,5,6,7]p4  [4,5,6,7]p4  2024-06-05T08:54:29.584216+0200  2024-06-05T08:54:29.584216+0200
3.4      189         0          0        0  754417698          137           8   3616  active+clean     9m   926'3616  951:19245  [6,7,4,5]p6  [6,7,4,5]p6  2024-06-05T08:54:02.307323+0200  2024-06-05T08:54:02.307323+0200
3.5      188         0          0        0  742543377            0           0   5992  active+clean     9m   926'5992  951:18862  [6,7,5,4]p6  [6,7,5,4]p6  2024-06-05T08:53:09.483136+0200  2024-06-05T08:53:09.483136+0200
3.6      191         0          0        0  769482752          150          16   6810  active+clean     9m   926'6810  951:30043  [7,6,4,5]p7  [7,6,4,5]p7  2024-06-05T08:53:46.646517+0200  2024-06-05T08:53:46.646517+0200
3.7      170         0          0        0  681587379            0           0  10081  active+clean     9m  926'21581  951:28473  [4,5,7,6]p4  [4,5,7,6]p4  2024-06-05T08:54:16.047967+0200  2024-06-05T08:54:16.047967+0200

ceph pg ls when the new crush rule is selected for pool SSD:

3.0      184       372        186        0  738070528            0           0   7458  active+recovery_wait+undersized+degraded+remapped    14s   926'7458  955:18688  [1,0,3,7]p1      [4,7]p4  2024-06-05T08:52:51.951936+0200  2024-06-05T08:52:51.951936+0200
3.1      169         0          0        0  667156655            0           0   3242                                       active+clean    19m   926'3242  954:14154  [4,5,6,7]p4  [4,5,6,7]p4  2024-06-05T08:54:42.948682+0200  2024-06-05T08:54:42.948682+0200
3.2      221       444          0        0  885641968            0           0   6989  active+recovery_wait+undersized+degraded+remapped    14s   926'6989  955:17657  [4,5,3,2]p4      [4,5]p4  2024-06-05T08:53:27.981787+0200  2024-06-05T08:53:27.981787+0200
3.3      180         0        540        0  716509184            0           0   3194     active+recovering+undersized+degraded+remapped    15s   926'3194  955:14204  [4,0,2,3]p4      [4,0]p4  2024-06-05T08:54:29.584216+0200  2024-06-05T08:54:29.584216+0200
3.4      189       378          0        0  754417698            0           0   3616  active+recovery_wait+undersized+degraded+remapped    15s   926'3616  955:19220  [1,4,6,2]p1      [4,6]p4  2024-06-05T08:54:02.307323+0200  2024-06-05T08:54:02.307323+0200
3.5      188       189          0        0  742543377            0           0   5992  active+recovery_wait+undersized+degraded+remapped    15s   926'5992  955:18845  [5,4,6,3]p5    [5,4,6]p5  2024-06-05T08:53:09.483136+0200  2024-06-05T08:53:09.483136+0200
3.6      191       390        195        0  769482752            0           0   6810  active+recovery_wait+undersized+degraded+remapped    14s   926'6810  955:30016  [0,1,2,7]p0      [4,7]p4  2024-06-05T08:53:46.646517+0200  2024-06-05T08:53:46.646517+0200
3.7      170         0        170        0  681587379            0           0  10081                      active+remapped+backfill_wait    14s  926'21581  955:28486  [4,5,3,6]p4  [4,5,6,7]p4  2024-06-05T08:54:16.047967+0200  2024-06-05T08:54:16.047967+0200

So CRUSH is able to find a suitable mapping just fine, but somehow Ceph 
decides to drop up to two healthy copies from its acting se,t and I do 
not understand why. I would expect only misplaced objects at this point.

Ceph version is 15.2.17. Latest CRUSH tunables (ceph osd crush tunables 
optimal).

Do I miss something obvious here? If so, would you please point it out 
to me? :D.

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx