Re: Degraded PGs on EC pool when marking an OSD out

Hector Martin <marcan@xxxxxxxxx> · Mon, 22 Jan 2024 22:57:23 +0900

On 2024/01/22 19:06, Frank Schilder wrote:
> You seem to have a problem with your crush rule(s):
> 
> 14.3d ... [18,17,16,3,1,0,NONE,NONE,12]
> 
> If you really just took out 1 OSD, having 2xNONE in the acting set indicates that your crush rule can't find valid mappings. You might need to tune crush tunables: https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-pg/?highlight=crush%20gives%20up#troubleshooting-pgs

Look closely: that's the *acting* (second column) OSD set, not the *up*
(first column) OSD set. It's supposed to be the *previous* set of OSDs
assigned to that PG, but inexplicably some OSDs just "fall off" when the
PGs get remapped around.

Simply waiting lets the data recover. At no point are any of my PGs
actually missing OSDs according to the current cluster state, and CRUSH
always finds a valid mapping. Rather the problem is that the *previous*
set of OSDs just loses some entries some for some reason.

The same problem happens when I *add* an OSD to the cluster. For
example, right now, osd.15 is out. This is the state of one pg:

14.3d       1044                   0         0          0        0
15730756731            0           0  1630         0      1630
active+clean  2024-01-22T20:15:46.684066+0900     15550'1630
15550:16184  [18,17,16,3,1,0,11,14,12]          18
[18,17,16,3,1,0,11,14,12]              18     15550'1629
2024-01-22T20:15:46.683491+0900              0'0
2024-01-08T15:18:21.654679+0900              0                    2
periodic scrub scheduled @ 2024-01-31T07:34:27.297723+0900
    1043                0

Note the OSD list ([18,17,16,3,1,0,11,14,12])

Then I bring osd.15 in and:

14.3d       1044                   0      1077          0        0
15730756731            0           0  1630         0      1630
active+recovery_wait+undersized+degraded+remapped
2024-01-22T22:52:22.700096+0900     15550'1630     15554:16163
[15,17,16,3,1,0,11,14,12]          15    [NONE,17,16,3,1,0,11,14,12]
         17     15550'1629  2024-01-22T20:15:46.683491+0900
0'0  2024-01-08T15:18:21.654679+0900              0                    2
 periodic scrub scheduled @ 2024-01-31T02:31:53.342289+0900
     1043                0

So somehow osd.18 "vanished" from the acting list
([NONE,17,16,3,1,0,11,14,12]) as it is being replaced by 15 in the new
up list ([15,17,16,3,1,0,11,14,12]). The data is in osd.18, but somehow
Ceph forgot.

> 
> It is possible that your low OSD count causes the "crush gives up too soon" issue. You might also consider to use a crush rule that places exactly 3 shards per host (examples were in posts just last week). Otherwise, it is not guaranteed that "... data remains available if a whole host goes down ..." because you might have 4 chunks on one of the hosts and fall below min_size (the failure domain of your crush rule for the EC profiles is OSD).

That should be what my CRUSH rule does. It picks 3 hosts then picks 3
OSDs per host (IIUC). And oddly enough everything works for the other EC
pool even though it shares the same CRUSH rule (just ignoring one OSD
from it).

> To test if your crush rules can generate valid mappings, you can pull the osdmap of your cluster and use osdmaptool to experiment with it without risk of destroying anything. It allows you to try different crush rules and failure scenarios on off-line but real cluster meta-data.

CRUSH steady state isn't the issue here, it's the dynamic state when
moving data that is the problem :)

> 
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> ________________________________________
> From: Hector Martin <marcan@xxxxxxxxx>
> Sent: Friday, January 19, 2024 10:12 AM
> To: ceph-users@xxxxxxx
> Subject:  Degraded PGs on EC pool when marking an OSD out
> 
> I'm having a bit of a weird issue with cluster rebalances with a new EC
> pool. I have a 3-machine cluster, each machine with 4 HDD OSDs (+1 SSD).
> Until now I've been using an erasure coded k=5 m=3 pool for most of my
> data. I've recently started to migrate to a k=5 m=4 pool, so I can
> configure the CRUSH rule to guarantee that data remains available if a
> whole host goes down (3 chunks per host, 9 total). I also moved the 5,3
> pool to this setup, although by nature I know its PGs will become
> inactive if a host goes down (need at least k+1 OSDs to be up).
> 
> I've only just started migrating data to the 5,4 pool, but I've noticed
> that any time I trigger any kind of backfilling (e.g. take one OSD out),
> a bunch of PGs in the 5,4 pool become degraded (instead of just
> misplaced/backfilling). This always seems to happen on that pool only,
> and the object count is a significant fraction of the total pool object
> count (it's not just "a few recently written objects while PGs were
> repeering" or anything like that, I know about that effect).
> 
> Here are the pools:
> 
> pool 13 'cephfs2_data_hec5.3' erasure profile ec5.3 size 8 min_size 6
> crush_rule 7 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode
> warn last_change 14133 lfor 0/11307/11305 flags
> hashpspool,ec_overwrites,bulk stripe_width 20480 application cephfs
> pool 14 'cephfs2_data_hec5.4' erasure profile ec5.4 size 9 min_size 6
> crush_rule 7 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode
> warn last_change 14509 lfor 0/0/14234 flags
> hashpspool,ec_overwrites,bulk stripe_width 20480 application cephfs
> 
> EC profiles:
> 
> # ceph osd erasure-code-profile get ec5.3
> crush-device-class=
> crush-failure-domain=osd
> crush-root=default
> jerasure-per-chunk-alignment=false
> k=5
> m=3
> plugin=jerasure
> technique=reed_sol_van
> w=8
> 
> # ceph osd erasure-code-profile get ec5.4
> crush-device-class=
> crush-failure-domain=osd
> crush-root=default
> jerasure-per-chunk-alignment=false
> k=5
> m=4
> plugin=jerasure
> technique=reed_sol_van
> w=8
> 
> They both use the same CRUSH rule, which is designed to select 9 OSDs
> balanced across the hosts (of which only 8 slots get used for the older
> 5,3 pool):
> 
> rule hdd-ec-x3 {
>         id 7
>         type erasure
>         step set_chooseleaf_tries 5
>         step set_choose_tries 100
>         step take default class hdd
>         step choose indep 3 type host
>         step choose indep 3 type osd
>         step emit
> }
> 
> If I take out an OSD (14), I get something like this:
> 
>     health: HEALTH_WARN
>             Degraded data redundancy: 37631/120155160 objects degraded
> (0.031%), 38 pgs degraded
> 
> All the degraded PGs are in the 5,4 pool, and the total object count is
> around 50k, so this is *most* of the data in the pool becoming degraded
> just because I marked an OSD out (without stopping it). If I mark the
> OSD in again, the degraded state goes away.
> 
> Example degraded PGs:
> 
> # ceph pg dump | grep degraded
> dumped all
> 14.3c        812                   0       838          0        0
> 11925027758            0           0  1088         0      1088
> active+recovery_wait+undersized+degraded+remapped
> 2024-01-19T18:06:41.786745+0900     15440'1088     15486:10772
> [18,17,16,1,3,2,11,13,12]          18    [18,17,16,1,3,2,11,NONE,12]
>          18      14537'432  2024-01-12T11:25:54.168048+0900
> 0'0  2024-01-08T15:18:21.654679+0900              0                    2
>  periodic scrub scheduled @ 2024-01-21T08:00:23.572904+0900
>       241                0
> 14.3d        772                   0      1602          0        0
> 11303280223            0           0  1283         0      1283
> active+recovery_wait+undersized+degraded+remapped
> 2024-01-19T18:06:41.919971+0900     15470'1283     15486:13384
> [18,17,16,3,1,0,13,11,12]          18  [18,17,16,3,1,0,NONE,NONE,12]
>          18      14990'771  2024-01-15T12:15:59.397469+0900
> 0'0  2024-01-08T15:18:21.654679+0900              0                    3
>  periodic scrub scheduled @ 2024-01-23T15:56:58.912801+0900
>       534                0
> 14.3e        806                   0       832          0        0
> 11843019697            0           0  1035         0      1035
> active+recovery_wait+undersized+degraded+remapped
> 2024-01-19T18:06:42.297251+0900     15465'1035     15486:15423
> [18,16,17,12,13,11,1,3,0]          18    [18,16,17,12,13,NONE,1,3,0]
>          18      14623'500  2024-01-13T08:54:55.709717+0900
> 0'0  2024-01-08T15:18:21.654679+0900              0                    1
>  periodic scrub scheduled @ 2024-01-22T09:54:51.278368+0900
>       331                0
> 14.3f        782                   0       813          0        0
> 11598393034            0           0  1083         0      1083
> active+recovery_wait+undersized+degraded+remapped
> 2024-01-19T18:06:41.845173+0900     15465'1083     15486:18496
> [17,18,16,3,0,1,11,12,13]          17    [17,18,16,3,0,1,11,NONE,13]
>          17      14990'800  2024-01-15T16:42:08.037844+0900
> 14990'800  2024-01-15T16:42:08.037844+0900              0
>    40  periodic scrub scheduled @ 2024-01-23T10:44:06.083985+0900
>             563                0
> 
> The first PG when I put the OSD back in:
> 
> 14.3c        812                   0         0          0        0
> 11925027758            0           0  1088         0      1088
>         active+clean  2024-01-19T18:07:18.079295+0900     15440'1088
> 15489:10792  [18,17,16,1,3,2,11,14,12]          18
> [18,17,16,1,3,2,11,14,12]              18      14537'432
> 2024-01-12T11:25:54.168048+0900              0'0
> 2024-01-08T15:18:21.654679+0900              0                    2
> periodic scrub scheduled @ 2024-01-21T09:41:43.026836+0900
>      241                0
> 
> As far as I know PGs are not supposed to actually become *degraded* when
> merely moving data around without any OSDs going down. Am I doing
> something wrong here? Any idea why this is affecting one pool and not
> both, even though they are almost identical in setup? It's as if, for
> this one pool, marking an OSD out has the effect of making its data
> unavailable entirely, instead of merely backfill to other OSDs (the OSD
> shows up as NONE in the above dump).
> 
> OSD tree:
> 
> ID   CLASS  WEIGHT    TYPE NAME          STATUS  REWEIGHT  PRI-AFF
>  -1         89.13765  root default
> -13         29.76414      host flamingo
>  11    hdd   7.27739          osd.11         up   1.00000  1.00000
>  12    hdd   7.27739          osd.12         up   1.00000  1.00000
>  13    hdd   7.27739          osd.13         up   1.00000  1.00000
>  14    hdd   7.20000          osd.14         up   1.00000  1.00000
>   8    ssd   0.73198          osd.8          up   1.00000  1.00000
> -10         29.84154      host heart
>   0    hdd   7.27739          osd.0          up   1.00000  1.00000
>   1    hdd   7.27739          osd.1          up   1.00000  1.00000
>   2    hdd   7.27739          osd.2          up   1.00000  1.00000
>   3    hdd   7.27739          osd.3          up   1.00000  1.00000
>   9    ssd   0.73198          osd.9          up   1.00000  1.00000
>  -3                0      host hub
>  -7         29.53197      host soleil
>  15    hdd   7.20000          osd.15         up         0  1.00000
>  16    hdd   7.20000          osd.16         up   1.00000  1.00000
>  17    hdd   7.20000          osd.17         up   1.00000  1.00000
>  18    hdd   7.20000          osd.18         up   1.00000  1.00000
>  10    ssd   0.73198          osd.10         up   1.00000  1.00000
> 
> (I'm in the middle of doing some reprovisioning so 15 is out, this
> happens any time I take any OSD out)
> 
> # ceph --version
> ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)
> 
> - Hector
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 

- Hector
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx