Hi Stefan,
I assume the number of dropped replicas is related to the pool's
min_size. If you increase min_size to 3 you should see only one
replica dropped from the acting set. I didn't run too detailed tests,
but a first quick one seems to confirm that:
# Test with min_size 2, size 4
48.7 [6,2,11,10]p6 [6,2]p6
Changed the rule back to the previous rule, then changed it again with
new min_size:
# Test with min_size 3, size 4
48.7 [6,2,11,10]p6 [6,2,10]p6
I don't really have an explanation why they are degraded, though. I
think Frank already had some open bug reports for that, this topic
comes up every now and then without any reasonable explanation.
Regards,
Eugen
Zitat von Stefan Kooman <stefan@xxxxxx>:
Hi,
TL;DR:
Selecting a different CRUSH rule (stretch_rule, no device class) for
pool SSD results in degraded objects (unexpected) and misplaced
objects (expected). Why would Ceph drop up to two healthy copies?
Consider this two data center cluster:
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.78400 root default
-10 0.39200 datacenter DC1
-3 0.39200 host pve1
0 hdd 0.09799 osd.0 up 1.00000 1.00000
1 hdd 0.09799 osd.1 up 1.00000 1.00000
4 ssd 0.09799 osd.4 up 1.00000 1.00000
5 ssd 0.09799 osd.5 up 1.00000 1.00000
-11 0.39200 datacenter DC2
-5 0.39200 host pve2
2 hdd 0.09799 osd.2 up 1.00000 1.00000
3 hdd 0.09799 osd.3 up 1.00000 1.00000
6 ssd 0.09799 osd.6 up 1.00000 1.00000
7 ssd 0.09799 osd.7 up 1.00000 1.00000
Pools available:
device_health_metrics
HDD
SSD
Let's focus on SSD for now. Crush rule in use for SSD pool:
rule SSD {
id 2
type replicated
min_size 1
max_size 10
step take default class ssd
step choose firstn 0 type host
step chooseleaf firstn 0 type osd
step emit
}
SSD pool replication settings: min_size=2, size=4
The new stretch rule to use:
rule stretch_rule {
id 3
type replicated
min_size 1
max_size 10
step take default
step take DC1
step choose firstn 0 type host
step chooseleaf firstn 2 type osd
step emit
step take default
step take DC2
step choose firstn 0 type host
step chooseleaf firstn 2 type osd
step emit
}
ceph pg ls for pool 3 (SSD):
3.0 184 0 0 0 738070528 0
0 7458 active+clean 9m 926'7458 951:18695
[7,6,5,4]p7 [7,6,5,4]p7 2024-06-05T08:52:51.951936+0200
2024-06-05T08:52:51.951936+0200
3.1 169 0 0 0 667156655 0
0 3242 active+clean 18m 926'3242 951:14151
[4,5,6,7]p4 [4,5,6,7]p4 2024-06-05T08:54:42.948682+0200
2024-06-05T08:54:42.948682+0200
3.2 221 0 0 0 885641968 0
0 6989 active+clean 9m 926'6989 951:17645
[4,5,7,6]p4 [4,5,7,6]p4 2024-06-05T08:53:27.981787+0200
2024-06-05T08:53:27.981787+0200
3.3 180 0 0 0 716509184 0
0 3194 active+clean 9m 926'3194 951:14191
[4,5,6,7]p4 [4,5,6,7]p4 2024-06-05T08:54:29.584216+0200
2024-06-05T08:54:29.584216+0200
3.4 189 0 0 0 754417698 137
8 3616 active+clean 9m 926'3616 951:19245
[6,7,4,5]p6 [6,7,4,5]p6 2024-06-05T08:54:02.307323+0200
2024-06-05T08:54:02.307323+0200
3.5 188 0 0 0 742543377 0
0 5992 active+clean 9m 926'5992 951:18862
[6,7,5,4]p6 [6,7,5,4]p6 2024-06-05T08:53:09.483136+0200
2024-06-05T08:53:09.483136+0200
3.6 191 0 0 0 769482752 150
16 6810 active+clean 9m 926'6810 951:30043
[7,6,4,5]p7 [7,6,4,5]p7 2024-06-05T08:53:46.646517+0200
2024-06-05T08:53:46.646517+0200
3.7 170 0 0 0 681587379 0
0 10081 active+clean 9m 926'21581 951:28473
[4,5,7,6]p4 [4,5,7,6]p4 2024-06-05T08:54:16.047967+0200
2024-06-05T08:54:16.047967+0200
ceph pg ls when the new crush rule is selected for pool SSD:
3.0 184 372 186 0 738070528 0
0 7458
active+recovery_wait+undersized+degraded+remapped 14s 926'7458
955:18688 [1,0,3,7]p1 [4,7]p4
2024-06-05T08:52:51.951936+0200 2024-06-05T08:52:51.951936+0200
3.1 169 0 0 0 667156655 0
0 3242
active+clean 19m 926'3242 954:14154 [4,5,6,7]p4 [4,5,6,7]p4
2024-06-05T08:54:42.948682+0200 2024-06-05T08:54:42.948682+0200
3.2 221 444 0 0 885641968 0
0 6989
active+recovery_wait+undersized+degraded+remapped 14s 926'6989
955:17657 [4,5,3,2]p4 [4,5]p4
2024-06-05T08:53:27.981787+0200 2024-06-05T08:53:27.981787+0200
3.3 180 0 540 0 716509184 0
0 3194
active+recovering+undersized+degraded+remapped 15s 926'3194
955:14204 [4,0,2,3]p4 [4,0]p4
2024-06-05T08:54:29.584216+0200 2024-06-05T08:54:29.584216+0200
3.4 189 378 0 0 754417698 0
0 3616
active+recovery_wait+undersized+degraded+remapped 15s 926'3616
955:19220 [1,4,6,2]p1 [4,6]p4
2024-06-05T08:54:02.307323+0200 2024-06-05T08:54:02.307323+0200
3.5 188 189 0 0 742543377 0
0 5992
active+recovery_wait+undersized+degraded+remapped 15s 926'5992
955:18845 [5,4,6,3]p5 [5,4,6]p5
2024-06-05T08:53:09.483136+0200 2024-06-05T08:53:09.483136+0200
3.6 191 390 195 0 769482752 0
0 6810
active+recovery_wait+undersized+degraded+remapped 14s 926'6810
955:30016 [0,1,2,7]p0 [4,7]p4
2024-06-05T08:53:46.646517+0200 2024-06-05T08:53:46.646517+0200
3.7 170 0 170 0 681587379 0
0 10081
active+remapped+backfill_wait 14s 926'21581 955:28486
[4,5,3,6]p4 [4,5,6,7]p4 2024-06-05T08:54:16.047967+0200
2024-06-05T08:54:16.047967+0200
So CRUSH is able to find a suitable mapping just fine, but somehow
Ceph decides to drop up to two healthy copies from its acting se,t
and I do not understand why. I would expect only misplaced objects
at this point.
Ceph version is 15.2.17. Latest CRUSH tunables (ceph osd crush
tunables optimal).
Do I miss something obvious here? If so, would you please point it
out to me? :D.
Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx