Degraded PGs on EC pool when marking an OSD out

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I'm having a bit of a weird issue with cluster rebalances with a new EC
pool. I have a 3-machine cluster, each machine with 4 HDD OSDs (+1 SSD).
Until now I've been using an erasure coded k=5 m=3 pool for most of my
data. I've recently started to migrate to a k=5 m=4 pool, so I can
configure the CRUSH rule to guarantee that data remains available if a
whole host goes down (3 chunks per host, 9 total). I also moved the 5,3
pool to this setup, although by nature I know its PGs will become
inactive if a host goes down (need at least k+1 OSDs to be up).

I've only just started migrating data to the 5,4 pool, but I've noticed
that any time I trigger any kind of backfilling (e.g. take one OSD out),
a bunch of PGs in the 5,4 pool become degraded (instead of just
misplaced/backfilling). This always seems to happen on that pool only,
and the object count is a significant fraction of the total pool object
count (it's not just "a few recently written objects while PGs were
repeering" or anything like that, I know about that effect).

Here are the pools:

pool 13 'cephfs2_data_hec5.3' erasure profile ec5.3 size 8 min_size 6
crush_rule 7 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode
warn last_change 14133 lfor 0/11307/11305 flags
hashpspool,ec_overwrites,bulk stripe_width 20480 application cephfs
pool 14 'cephfs2_data_hec5.4' erasure profile ec5.4 size 9 min_size 6
crush_rule 7 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode
warn last_change 14509 lfor 0/0/14234 flags
hashpspool,ec_overwrites,bulk stripe_width 20480 application cephfs

EC profiles:

# ceph osd erasure-code-profile get ec5.3
crush-device-class=
crush-failure-domain=osd
crush-root=default
jerasure-per-chunk-alignment=false
k=5
m=3
plugin=jerasure
technique=reed_sol_van
w=8

# ceph osd erasure-code-profile get ec5.4
crush-device-class=
crush-failure-domain=osd
crush-root=default
jerasure-per-chunk-alignment=false
k=5
m=4
plugin=jerasure
technique=reed_sol_van
w=8

They both use the same CRUSH rule, which is designed to select 9 OSDs
balanced across the hosts (of which only 8 slots get used for the older
5,3 pool):

rule hdd-ec-x3 {
        id 7
        type erasure
        step set_chooseleaf_tries 5
        step set_choose_tries 100
        step take default class hdd
        step choose indep 3 type host
        step choose indep 3 type osd
        step emit
}

If I take out an OSD (14), I get something like this:

    health: HEALTH_WARN
            Degraded data redundancy: 37631/120155160 objects degraded
(0.031%), 38 pgs degraded

All the degraded PGs are in the 5,4 pool, and the total object count is
around 50k, so this is *most* of the data in the pool becoming degraded
just because I marked an OSD out (without stopping it). If I mark the
OSD in again, the degraded state goes away.

Example degraded PGs:

# ceph pg dump | grep degraded
dumped all
14.3c        812                   0       838          0        0
11925027758            0           0  1088         0      1088
active+recovery_wait+undersized+degraded+remapped
2024-01-19T18:06:41.786745+0900     15440'1088     15486:10772
[18,17,16,1,3,2,11,13,12]          18    [18,17,16,1,3,2,11,NONE,12]
         18      14537'432  2024-01-12T11:25:54.168048+0900
0'0  2024-01-08T15:18:21.654679+0900              0                    2
 periodic scrub scheduled @ 2024-01-21T08:00:23.572904+0900
      241                0
14.3d        772                   0      1602          0        0
11303280223            0           0  1283         0      1283
active+recovery_wait+undersized+degraded+remapped
2024-01-19T18:06:41.919971+0900     15470'1283     15486:13384
[18,17,16,3,1,0,13,11,12]          18  [18,17,16,3,1,0,NONE,NONE,12]
         18      14990'771  2024-01-15T12:15:59.397469+0900
0'0  2024-01-08T15:18:21.654679+0900              0                    3
 periodic scrub scheduled @ 2024-01-23T15:56:58.912801+0900
      534                0
14.3e        806                   0       832          0        0
11843019697            0           0  1035         0      1035
active+recovery_wait+undersized+degraded+remapped
2024-01-19T18:06:42.297251+0900     15465'1035     15486:15423
[18,16,17,12,13,11,1,3,0]          18    [18,16,17,12,13,NONE,1,3,0]
         18      14623'500  2024-01-13T08:54:55.709717+0900
0'0  2024-01-08T15:18:21.654679+0900              0                    1
 periodic scrub scheduled @ 2024-01-22T09:54:51.278368+0900
      331                0
14.3f        782                   0       813          0        0
11598393034            0           0  1083         0      1083
active+recovery_wait+undersized+degraded+remapped
2024-01-19T18:06:41.845173+0900     15465'1083     15486:18496
[17,18,16,3,0,1,11,12,13]          17    [17,18,16,3,0,1,11,NONE,13]
         17      14990'800  2024-01-15T16:42:08.037844+0900
14990'800  2024-01-15T16:42:08.037844+0900              0
   40  periodic scrub scheduled @ 2024-01-23T10:44:06.083985+0900
            563                0

The first PG when I put the OSD back in:

14.3c        812                   0         0          0        0
11925027758            0           0  1088         0      1088
        active+clean  2024-01-19T18:07:18.079295+0900     15440'1088
15489:10792  [18,17,16,1,3,2,11,14,12]          18
[18,17,16,1,3,2,11,14,12]              18      14537'432
2024-01-12T11:25:54.168048+0900              0'0
2024-01-08T15:18:21.654679+0900              0                    2
periodic scrub scheduled @ 2024-01-21T09:41:43.026836+0900
     241                0

As far as I know PGs are not supposed to actually become *degraded* when
merely moving data around without any OSDs going down. Am I doing
something wrong here? Any idea why this is affecting one pool and not
both, even though they are almost identical in setup? It's as if, for
this one pool, marking an OSD out has the effect of making its data
unavailable entirely, instead of merely backfill to other OSDs (the OSD
shows up as NONE in the above dump).

OSD tree:

ID   CLASS  WEIGHT    TYPE NAME          STATUS  REWEIGHT  PRI-AFF
 -1         89.13765  root default
-13         29.76414      host flamingo
 11    hdd   7.27739          osd.11         up   1.00000  1.00000
 12    hdd   7.27739          osd.12         up   1.00000  1.00000
 13    hdd   7.27739          osd.13         up   1.00000  1.00000
 14    hdd   7.20000          osd.14         up   1.00000  1.00000
  8    ssd   0.73198          osd.8          up   1.00000  1.00000
-10         29.84154      host heart
  0    hdd   7.27739          osd.0          up   1.00000  1.00000
  1    hdd   7.27739          osd.1          up   1.00000  1.00000
  2    hdd   7.27739          osd.2          up   1.00000  1.00000
  3    hdd   7.27739          osd.3          up   1.00000  1.00000
  9    ssd   0.73198          osd.9          up   1.00000  1.00000
 -3                0      host hub
 -7         29.53197      host soleil
 15    hdd   7.20000          osd.15         up         0  1.00000
 16    hdd   7.20000          osd.16         up   1.00000  1.00000
 17    hdd   7.20000          osd.17         up   1.00000  1.00000
 18    hdd   7.20000          osd.18         up   1.00000  1.00000
 10    ssd   0.73198          osd.10         up   1.00000  1.00000

(I'm in the middle of doing some reprovisioning so 15 is out, this
happens any time I take any OSD out)

# ceph --version
ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)

- Hector
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux