Re: degraded objects when setting different CRUSH rule on a pool, why?

Eugen Block <eblock@xxxxxx> · Tue, 11 Jun 2024 13:05:59 +0000

Hi Stefan,

I assume the number of dropped replicas is related to the pool's  
min_size. If you increase min_size to 3 you should see only one  
replica dropped from the acting set. I didn't run too detailed tests,  
but a first quick one seems to confirm that:

# Test with min_size 2, size 4
48.7   [6,2,11,10]p6         [6,2]p6

Changed the rule back to the previous rule, then changed it again with  
new min_size:

# Test with min_size 3, size 4
48.7   [6,2,11,10]p6      [6,2,10]p6

I don't really have an explanation why they are degraded, though. I  
think Frank already had some open bug reports for that, this topic  
comes up every now and then without any reasonable explanation.

Regards,
Eugen

Zitat von Stefan Kooman <stefan@xxxxxx>:

Hi,

TL;DR:

Selecting a different CRUSH rule (stretch_rule, no device class) for  
pool SSD results in degraded objects (unexpected) and misplaced  
objects (expected). Why would Ceph drop up to two healthy copies?

Consider this two data center cluster:

ID   CLASS  WEIGHT   TYPE NAME           STATUS  REWEIGHT  PRI-AFF
 -1         0.78400  root default
-10         0.39200      datacenter DC1
 -3         0.39200          host pve1
  0    hdd  0.09799              osd.0       up   1.00000  1.00000
  1    hdd  0.09799              osd.1       up   1.00000  1.00000
  4    ssd  0.09799              osd.4       up   1.00000  1.00000
  5    ssd  0.09799              osd.5       up   1.00000  1.00000
-11         0.39200      datacenter DC2
 -5         0.39200          host pve2
  2    hdd  0.09799              osd.2       up   1.00000  1.00000
  3    hdd  0.09799              osd.3       up   1.00000  1.00000
  6    ssd  0.09799              osd.6       up   1.00000  1.00000
  7    ssd  0.09799              osd.7       up   1.00000  1.00000

Pools available:

device_health_metrics
HDD
SSD

Let's focus on SSD for now. Crush rule in use for SSD pool:

rule SSD {
    id 2
    type replicated
    min_size 1
    max_size 10
    step take default class ssd
    step choose firstn 0 type host
    step chooseleaf firstn 0 type osd
    step emit
}

SSD pool replication settings: min_size=2, size=4

The new stretch rule to use:

rule stretch_rule {
    id 3
    type replicated
    min_size 1
    max_size 10
    step take default
    step take DC1
    step choose firstn 0 type host
    step chooseleaf firstn 2 type osd
    step emit
    step take default
    step take DC2
    step choose firstn 0 type host
    step chooseleaf firstn 2 type osd
    step emit
}

ceph pg ls for pool 3 (SSD):

3.0      184         0          0        0  738070528            0   
         0   7458  active+clean     9m   926'7458  951:18695   
[7,6,5,4]p7  [7,6,5,4]p7  2024-06-05T08:52:51.951936+0200   
2024-06-05T08:52:51.951936+0200
3.1      169         0          0        0  667156655            0   
         0   3242  active+clean    18m   926'3242  951:14151   
[4,5,6,7]p4  [4,5,6,7]p4  2024-06-05T08:54:42.948682+0200   
2024-06-05T08:54:42.948682+0200
3.2      221         0          0        0  885641968            0   
         0   6989  active+clean     9m   926'6989  951:17645   
[4,5,7,6]p4  [4,5,7,6]p4  2024-06-05T08:53:27.981787+0200   
2024-06-05T08:53:27.981787+0200
3.3      180         0          0        0  716509184            0   
         0   3194  active+clean     9m   926'3194  951:14191   
[4,5,6,7]p4  [4,5,6,7]p4  2024-06-05T08:54:29.584216+0200   
2024-06-05T08:54:29.584216+0200
3.4      189         0          0        0  754417698          137   
         8   3616  active+clean     9m   926'3616  951:19245   
[6,7,4,5]p6  [6,7,4,5]p6  2024-06-05T08:54:02.307323+0200   
2024-06-05T08:54:02.307323+0200
3.5      188         0          0        0  742543377            0   
         0   5992  active+clean     9m   926'5992  951:18862   
[6,7,5,4]p6  [6,7,5,4]p6  2024-06-05T08:53:09.483136+0200   
2024-06-05T08:53:09.483136+0200
3.6      191         0          0        0  769482752          150   
        16   6810  active+clean     9m   926'6810  951:30043   
[7,6,4,5]p7  [7,6,4,5]p7  2024-06-05T08:53:46.646517+0200   
2024-06-05T08:53:46.646517+0200
3.7      170         0          0        0  681587379            0   
         0  10081  active+clean     9m  926'21581  951:28473   
[4,5,7,6]p4  [4,5,7,6]p4  2024-06-05T08:54:16.047967+0200   
2024-06-05T08:54:16.047967+0200

ceph pg ls when the new crush rule is selected for pool SSD:

3.0      184       372        186        0  738070528            0   
         0   7458   
active+recovery_wait+undersized+degraded+remapped    14s   926'7458  
 955:18688  [1,0,3,7]p1      [4,7]p4   
2024-06-05T08:52:51.951936+0200  2024-06-05T08:52:51.951936+0200
3.1      169         0          0        0  667156655            0   
         0   3242                                        
active+clean    19m   926'3242  954:14154  [4,5,6,7]p4  [4,5,6,7]p4  
 2024-06-05T08:54:42.948682+0200  2024-06-05T08:54:42.948682+0200
3.2      221       444          0        0  885641968            0   
         0   6989   
active+recovery_wait+undersized+degraded+remapped    14s   926'6989  
 955:17657  [4,5,3,2]p4      [4,5]p4   
2024-06-05T08:53:27.981787+0200  2024-06-05T08:53:27.981787+0200
3.3      180         0        540        0  716509184            0   
         0   3194      
active+recovering+undersized+degraded+remapped    15s   926'3194   
955:14204  [4,0,2,3]p4      [4,0]p4   
2024-06-05T08:54:29.584216+0200  2024-06-05T08:54:29.584216+0200
3.4      189       378          0        0  754417698            0   
         0   3616   
active+recovery_wait+undersized+degraded+remapped    15s   926'3616  
 955:19220  [1,4,6,2]p1      [4,6]p4   
2024-06-05T08:54:02.307323+0200  2024-06-05T08:54:02.307323+0200
3.5      188       189          0        0  742543377            0   
         0   5992   
active+recovery_wait+undersized+degraded+remapped    15s   926'5992  
 955:18845  [5,4,6,3]p5    [5,4,6]p5   
2024-06-05T08:53:09.483136+0200  2024-06-05T08:53:09.483136+0200
3.6      191       390        195        0  769482752            0   
         0   6810   
active+recovery_wait+undersized+degraded+remapped    14s   926'6810  
 955:30016  [0,1,2,7]p0      [4,7]p4   
2024-06-05T08:53:46.646517+0200  2024-06-05T08:53:46.646517+0200
3.7      170         0        170        0  681587379            0   
         0  10081                       
active+remapped+backfill_wait    14s  926'21581  955:28486   
[4,5,3,6]p4  [4,5,6,7]p4  2024-06-05T08:54:16.047967+0200   
2024-06-05T08:54:16.047967+0200

So CRUSH is able to find a suitable mapping just fine, but somehow  
Ceph decides to drop up to two healthy copies from its acting se,t  
and I do not understand why. I would expect only misplaced objects  
at this point.

Ceph version is 15.2.17. Latest CRUSH tunables (ceph osd crush  
tunables optimal).

Do I miss something obvious here? If so, would you please point it  
out to me? :D.

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx