You seem to have a problem with your crush rule(s): 14.3d ... [18,17,16,3,1,0,NONE,NONE,12] If you really just took out 1 OSD, having 2xNONE in the acting set indicates that your crush rule can't find valid mappings. You might need to tune crush tunables: https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-pg/?highlight=crush%20gives%20up#troubleshooting-pgs It is possible that your low OSD count causes the "crush gives up too soon" issue. You might also consider to use a crush rule that places exactly 3 shards per host (examples were in posts just last week). Otherwise, it is not guaranteed that "... data remains available if a whole host goes down ..." because you might have 4 chunks on one of the hosts and fall below min_size (the failure domain of your crush rule for the EC profiles is OSD). To test if your crush rules can generate valid mappings, you can pull the osdmap of your cluster and use osdmaptool to experiment with it without risk of destroying anything. It allows you to try different crush rules and failure scenarios on off-line but real cluster meta-data. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Hector Martin <marcan@xxxxxxxxx> Sent: Friday, January 19, 2024 10:12 AM To: ceph-users@xxxxxxx Subject: Degraded PGs on EC pool when marking an OSD out I'm having a bit of a weird issue with cluster rebalances with a new EC pool. I have a 3-machine cluster, each machine with 4 HDD OSDs (+1 SSD). Until now I've been using an erasure coded k=5 m=3 pool for most of my data. I've recently started to migrate to a k=5 m=4 pool, so I can configure the CRUSH rule to guarantee that data remains available if a whole host goes down (3 chunks per host, 9 total). I also moved the 5,3 pool to this setup, although by nature I know its PGs will become inactive if a host goes down (need at least k+1 OSDs to be up). I've only just started migrating data to the 5,4 pool, but I've noticed that any time I trigger any kind of backfilling (e.g. take one OSD out), a bunch of PGs in the 5,4 pool become degraded (instead of just misplaced/backfilling). This always seems to happen on that pool only, and the object count is a significant fraction of the total pool object count (it's not just "a few recently written objects while PGs were repeering" or anything like that, I know about that effect). Here are the pools: pool 13 'cephfs2_data_hec5.3' erasure profile ec5.3 size 8 min_size 6 crush_rule 7 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode warn last_change 14133 lfor 0/11307/11305 flags hashpspool,ec_overwrites,bulk stripe_width 20480 application cephfs pool 14 'cephfs2_data_hec5.4' erasure profile ec5.4 size 9 min_size 6 crush_rule 7 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode warn last_change 14509 lfor 0/0/14234 flags hashpspool,ec_overwrites,bulk stripe_width 20480 application cephfs EC profiles: # ceph osd erasure-code-profile get ec5.3 crush-device-class= crush-failure-domain=osd crush-root=default jerasure-per-chunk-alignment=false k=5 m=3 plugin=jerasure technique=reed_sol_van w=8 # ceph osd erasure-code-profile get ec5.4 crush-device-class= crush-failure-domain=osd crush-root=default jerasure-per-chunk-alignment=false k=5 m=4 plugin=jerasure technique=reed_sol_van w=8 They both use the same CRUSH rule, which is designed to select 9 OSDs balanced across the hosts (of which only 8 slots get used for the older 5,3 pool): rule hdd-ec-x3 { id 7 type erasure step set_chooseleaf_tries 5 step set_choose_tries 100 step take default class hdd step choose indep 3 type host step choose indep 3 type osd step emit } If I take out an OSD (14), I get something like this: health: HEALTH_WARN Degraded data redundancy: 37631/120155160 objects degraded (0.031%), 38 pgs degraded All the degraded PGs are in the 5,4 pool, and the total object count is around 50k, so this is *most* of the data in the pool becoming degraded just because I marked an OSD out (without stopping it). If I mark the OSD in again, the degraded state goes away. Example degraded PGs: # ceph pg dump | grep degraded dumped all 14.3c 812 0 838 0 0 11925027758 0 0 1088 0 1088 active+recovery_wait+undersized+degraded+remapped 2024-01-19T18:06:41.786745+0900 15440'1088 15486:10772 [18,17,16,1,3,2,11,13,12] 18 [18,17,16,1,3,2,11,NONE,12] 18 14537'432 2024-01-12T11:25:54.168048+0900 0'0 2024-01-08T15:18:21.654679+0900 0 2 periodic scrub scheduled @ 2024-01-21T08:00:23.572904+0900 241 0 14.3d 772 0 1602 0 0 11303280223 0 0 1283 0 1283 active+recovery_wait+undersized+degraded+remapped 2024-01-19T18:06:41.919971+0900 15470'1283 15486:13384 [18,17,16,3,1,0,13,11,12] 18 [18,17,16,3,1,0,NONE,NONE,12] 18 14990'771 2024-01-15T12:15:59.397469+0900 0'0 2024-01-08T15:18:21.654679+0900 0 3 periodic scrub scheduled @ 2024-01-23T15:56:58.912801+0900 534 0 14.3e 806 0 832 0 0 11843019697 0 0 1035 0 1035 active+recovery_wait+undersized+degraded+remapped 2024-01-19T18:06:42.297251+0900 15465'1035 15486:15423 [18,16,17,12,13,11,1,3,0] 18 [18,16,17,12,13,NONE,1,3,0] 18 14623'500 2024-01-13T08:54:55.709717+0900 0'0 2024-01-08T15:18:21.654679+0900 0 1 periodic scrub scheduled @ 2024-01-22T09:54:51.278368+0900 331 0 14.3f 782 0 813 0 0 11598393034 0 0 1083 0 1083 active+recovery_wait+undersized+degraded+remapped 2024-01-19T18:06:41.845173+0900 15465'1083 15486:18496 [17,18,16,3,0,1,11,12,13] 17 [17,18,16,3,0,1,11,NONE,13] 17 14990'800 2024-01-15T16:42:08.037844+0900 14990'800 2024-01-15T16:42:08.037844+0900 0 40 periodic scrub scheduled @ 2024-01-23T10:44:06.083985+0900 563 0 The first PG when I put the OSD back in: 14.3c 812 0 0 0 0 11925027758 0 0 1088 0 1088 active+clean 2024-01-19T18:07:18.079295+0900 15440'1088 15489:10792 [18,17,16,1,3,2,11,14,12] 18 [18,17,16,1,3,2,11,14,12] 18 14537'432 2024-01-12T11:25:54.168048+0900 0'0 2024-01-08T15:18:21.654679+0900 0 2 periodic scrub scheduled @ 2024-01-21T09:41:43.026836+0900 241 0 As far as I know PGs are not supposed to actually become *degraded* when merely moving data around without any OSDs going down. Am I doing something wrong here? Any idea why this is affecting one pool and not both, even though they are almost identical in setup? It's as if, for this one pool, marking an OSD out has the effect of making its data unavailable entirely, instead of merely backfill to other OSDs (the OSD shows up as NONE in the above dump). OSD tree: ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 89.13765 root default -13 29.76414 host flamingo 11 hdd 7.27739 osd.11 up 1.00000 1.00000 12 hdd 7.27739 osd.12 up 1.00000 1.00000 13 hdd 7.27739 osd.13 up 1.00000 1.00000 14 hdd 7.20000 osd.14 up 1.00000 1.00000 8 ssd 0.73198 osd.8 up 1.00000 1.00000 -10 29.84154 host heart 0 hdd 7.27739 osd.0 up 1.00000 1.00000 1 hdd 7.27739 osd.1 up 1.00000 1.00000 2 hdd 7.27739 osd.2 up 1.00000 1.00000 3 hdd 7.27739 osd.3 up 1.00000 1.00000 9 ssd 0.73198 osd.9 up 1.00000 1.00000 -3 0 host hub -7 29.53197 host soleil 15 hdd 7.20000 osd.15 up 0 1.00000 16 hdd 7.20000 osd.16 up 1.00000 1.00000 17 hdd 7.20000 osd.17 up 1.00000 1.00000 18 hdd 7.20000 osd.18 up 1.00000 1.00000 10 ssd 0.73198 osd.10 up 1.00000 1.00000 (I'm in the middle of doing some reprovisioning so 15 is out, this happens any time I take any OSD out) # ceph --version ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable) - Hector _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx