On 2024/01/22 19:06, Frank Schilder wrote: > You seem to have a problem with your crush rule(s): > > 14.3d ... [18,17,16,3,1,0,NONE,NONE,12] > > If you really just took out 1 OSD, having 2xNONE in the acting set indicates that your crush rule can't find valid mappings. You might need to tune crush tunables: https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-pg/?highlight=crush%20gives%20up#troubleshooting-pgs Look closely: that's the *acting* (second column) OSD set, not the *up* (first column) OSD set. It's supposed to be the *previous* set of OSDs assigned to that PG, but inexplicably some OSDs just "fall off" when the PGs get remapped around. Simply waiting lets the data recover. At no point are any of my PGs actually missing OSDs according to the current cluster state, and CRUSH always finds a valid mapping. Rather the problem is that the *previous* set of OSDs just loses some entries some for some reason. The same problem happens when I *add* an OSD to the cluster. For example, right now, osd.15 is out. This is the state of one pg: 14.3d 1044 0 0 0 0 15730756731 0 0 1630 0 1630 active+clean 2024-01-22T20:15:46.684066+0900 15550'1630 15550:16184 [18,17,16,3,1,0,11,14,12] 18 [18,17,16,3,1,0,11,14,12] 18 15550'1629 2024-01-22T20:15:46.683491+0900 0'0 2024-01-08T15:18:21.654679+0900 0 2 periodic scrub scheduled @ 2024-01-31T07:34:27.297723+0900 1043 0 Note the OSD list ([18,17,16,3,1,0,11,14,12]) Then I bring osd.15 in and: 14.3d 1044 0 1077 0 0 15730756731 0 0 1630 0 1630 active+recovery_wait+undersized+degraded+remapped 2024-01-22T22:52:22.700096+0900 15550'1630 15554:16163 [15,17,16,3,1,0,11,14,12] 15 [NONE,17,16,3,1,0,11,14,12] 17 15550'1629 2024-01-22T20:15:46.683491+0900 0'0 2024-01-08T15:18:21.654679+0900 0 2 periodic scrub scheduled @ 2024-01-31T02:31:53.342289+0900 1043 0 So somehow osd.18 "vanished" from the acting list ([NONE,17,16,3,1,0,11,14,12]) as it is being replaced by 15 in the new up list ([15,17,16,3,1,0,11,14,12]). The data is in osd.18, but somehow Ceph forgot. > > It is possible that your low OSD count causes the "crush gives up too soon" issue. You might also consider to use a crush rule that places exactly 3 shards per host (examples were in posts just last week). Otherwise, it is not guaranteed that "... data remains available if a whole host goes down ..." because you might have 4 chunks on one of the hosts and fall below min_size (the failure domain of your crush rule for the EC profiles is OSD). That should be what my CRUSH rule does. It picks 3 hosts then picks 3 OSDs per host (IIUC). And oddly enough everything works for the other EC pool even though it shares the same CRUSH rule (just ignoring one OSD from it). > To test if your crush rules can generate valid mappings, you can pull the osdmap of your cluster and use osdmaptool to experiment with it without risk of destroying anything. It allows you to try different crush rules and failure scenarios on off-line but real cluster meta-data. CRUSH steady state isn't the issue here, it's the dynamic state when moving data that is the problem :) > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Hector Martin <marcan@xxxxxxxxx> > Sent: Friday, January 19, 2024 10:12 AM > To: ceph-users@xxxxxxx > Subject: Degraded PGs on EC pool when marking an OSD out > > I'm having a bit of a weird issue with cluster rebalances with a new EC > pool. I have a 3-machine cluster, each machine with 4 HDD OSDs (+1 SSD). > Until now I've been using an erasure coded k=5 m=3 pool for most of my > data. I've recently started to migrate to a k=5 m=4 pool, so I can > configure the CRUSH rule to guarantee that data remains available if a > whole host goes down (3 chunks per host, 9 total). I also moved the 5,3 > pool to this setup, although by nature I know its PGs will become > inactive if a host goes down (need at least k+1 OSDs to be up). > > I've only just started migrating data to the 5,4 pool, but I've noticed > that any time I trigger any kind of backfilling (e.g. take one OSD out), > a bunch of PGs in the 5,4 pool become degraded (instead of just > misplaced/backfilling). This always seems to happen on that pool only, > and the object count is a significant fraction of the total pool object > count (it's not just "a few recently written objects while PGs were > repeering" or anything like that, I know about that effect). > > Here are the pools: > > pool 13 'cephfs2_data_hec5.3' erasure profile ec5.3 size 8 min_size 6 > crush_rule 7 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode > warn last_change 14133 lfor 0/11307/11305 flags > hashpspool,ec_overwrites,bulk stripe_width 20480 application cephfs > pool 14 'cephfs2_data_hec5.4' erasure profile ec5.4 size 9 min_size 6 > crush_rule 7 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode > warn last_change 14509 lfor 0/0/14234 flags > hashpspool,ec_overwrites,bulk stripe_width 20480 application cephfs > > EC profiles: > > # ceph osd erasure-code-profile get ec5.3 > crush-device-class= > crush-failure-domain=osd > crush-root=default > jerasure-per-chunk-alignment=false > k=5 > m=3 > plugin=jerasure > technique=reed_sol_van > w=8 > > # ceph osd erasure-code-profile get ec5.4 > crush-device-class= > crush-failure-domain=osd > crush-root=default > jerasure-per-chunk-alignment=false > k=5 > m=4 > plugin=jerasure > technique=reed_sol_van > w=8 > > They both use the same CRUSH rule, which is designed to select 9 OSDs > balanced across the hosts (of which only 8 slots get used for the older > 5,3 pool): > > rule hdd-ec-x3 { > id 7 > type erasure > step set_chooseleaf_tries 5 > step set_choose_tries 100 > step take default class hdd > step choose indep 3 type host > step choose indep 3 type osd > step emit > } > > If I take out an OSD (14), I get something like this: > > health: HEALTH_WARN > Degraded data redundancy: 37631/120155160 objects degraded > (0.031%), 38 pgs degraded > > All the degraded PGs are in the 5,4 pool, and the total object count is > around 50k, so this is *most* of the data in the pool becoming degraded > just because I marked an OSD out (without stopping it). If I mark the > OSD in again, the degraded state goes away. > > Example degraded PGs: > > # ceph pg dump | grep degraded > dumped all > 14.3c 812 0 838 0 0 > 11925027758 0 0 1088 0 1088 > active+recovery_wait+undersized+degraded+remapped > 2024-01-19T18:06:41.786745+0900 15440'1088 15486:10772 > [18,17,16,1,3,2,11,13,12] 18 [18,17,16,1,3,2,11,NONE,12] > 18 14537'432 2024-01-12T11:25:54.168048+0900 > 0'0 2024-01-08T15:18:21.654679+0900 0 2 > periodic scrub scheduled @ 2024-01-21T08:00:23.572904+0900 > 241 0 > 14.3d 772 0 1602 0 0 > 11303280223 0 0 1283 0 1283 > active+recovery_wait+undersized+degraded+remapped > 2024-01-19T18:06:41.919971+0900 15470'1283 15486:13384 > [18,17,16,3,1,0,13,11,12] 18 [18,17,16,3,1,0,NONE,NONE,12] > 18 14990'771 2024-01-15T12:15:59.397469+0900 > 0'0 2024-01-08T15:18:21.654679+0900 0 3 > periodic scrub scheduled @ 2024-01-23T15:56:58.912801+0900 > 534 0 > 14.3e 806 0 832 0 0 > 11843019697 0 0 1035 0 1035 > active+recovery_wait+undersized+degraded+remapped > 2024-01-19T18:06:42.297251+0900 15465'1035 15486:15423 > [18,16,17,12,13,11,1,3,0] 18 [18,16,17,12,13,NONE,1,3,0] > 18 14623'500 2024-01-13T08:54:55.709717+0900 > 0'0 2024-01-08T15:18:21.654679+0900 0 1 > periodic scrub scheduled @ 2024-01-22T09:54:51.278368+0900 > 331 0 > 14.3f 782 0 813 0 0 > 11598393034 0 0 1083 0 1083 > active+recovery_wait+undersized+degraded+remapped > 2024-01-19T18:06:41.845173+0900 15465'1083 15486:18496 > [17,18,16,3,0,1,11,12,13] 17 [17,18,16,3,0,1,11,NONE,13] > 17 14990'800 2024-01-15T16:42:08.037844+0900 > 14990'800 2024-01-15T16:42:08.037844+0900 0 > 40 periodic scrub scheduled @ 2024-01-23T10:44:06.083985+0900 > 563 0 > > The first PG when I put the OSD back in: > > 14.3c 812 0 0 0 0 > 11925027758 0 0 1088 0 1088 > active+clean 2024-01-19T18:07:18.079295+0900 15440'1088 > 15489:10792 [18,17,16,1,3,2,11,14,12] 18 > [18,17,16,1,3,2,11,14,12] 18 14537'432 > 2024-01-12T11:25:54.168048+0900 0'0 > 2024-01-08T15:18:21.654679+0900 0 2 > periodic scrub scheduled @ 2024-01-21T09:41:43.026836+0900 > 241 0 > > As far as I know PGs are not supposed to actually become *degraded* when > merely moving data around without any OSDs going down. Am I doing > something wrong here? Any idea why this is affecting one pool and not > both, even though they are almost identical in setup? It's as if, for > this one pool, marking an OSD out has the effect of making its data > unavailable entirely, instead of merely backfill to other OSDs (the OSD > shows up as NONE in the above dump). > > OSD tree: > > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > -1 89.13765 root default > -13 29.76414 host flamingo > 11 hdd 7.27739 osd.11 up 1.00000 1.00000 > 12 hdd 7.27739 osd.12 up 1.00000 1.00000 > 13 hdd 7.27739 osd.13 up 1.00000 1.00000 > 14 hdd 7.20000 osd.14 up 1.00000 1.00000 > 8 ssd 0.73198 osd.8 up 1.00000 1.00000 > -10 29.84154 host heart > 0 hdd 7.27739 osd.0 up 1.00000 1.00000 > 1 hdd 7.27739 osd.1 up 1.00000 1.00000 > 2 hdd 7.27739 osd.2 up 1.00000 1.00000 > 3 hdd 7.27739 osd.3 up 1.00000 1.00000 > 9 ssd 0.73198 osd.9 up 1.00000 1.00000 > -3 0 host hub > -7 29.53197 host soleil > 15 hdd 7.20000 osd.15 up 0 1.00000 > 16 hdd 7.20000 osd.16 up 1.00000 1.00000 > 17 hdd 7.20000 osd.17 up 1.00000 1.00000 > 18 hdd 7.20000 osd.18 up 1.00000 1.00000 > 10 ssd 0.73198 osd.10 up 1.00000 1.00000 > > (I'm in the middle of doing some reprovisioning so 15 is out, this > happens any time I take any OSD out) > > # ceph --version > ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable) > > - Hector > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > - Hector _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx