Re: Degraded PGs on EC pool when marking an OSD out

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

this topic pops up every now and then, and although I don't have definitive proof for my assumptions I still stand with them. ;-) As the docs [2] already state, it's expected that PGs become degraded after some sort of failure (setting an OSD "out" falls into that category IMO):

It is normal for placement groups to enter “degraded” or “peering” states after a component failure. Normally, these states reflect the expected progression through the failure recovery process. However, a placement group that stays in one of these states for a long time might be an indication of a larger problem.

And you report that your PGs do not stay in that state but eventually recover. My understanding is as follows: PGs have to be recreated on different hosts/OSDs after setting an OSD "out". During this transition (peering) the PGs are degraded until the newly assigned OSD have noticed their new responsibility (I'm not familiar with the actual data flow). The degraded state then clears as long as the out OSD is up (its PGs are active). If you stop that OSD ("down") the PGs become and stay degraded until they have been fully recreated on different hosts/OSDs. Not sure what impacts the duration until the degraded state clears, but in my small test cluster (similar osd tree as yours) the degraded state clears after a few seconds only, but I only have a few (almost empty) PGs in the EC test pool.

I guess a comment from the devs couldn't hurt to clear this up.

[2] https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-pg/#stuck-placement-groups

Zitat von Hector Martin <marcan@xxxxxxxxx>:

On 2024/01/22 19:06, Frank Schilder wrote:
You seem to have a problem with your crush rule(s):

14.3d ... [18,17,16,3,1,0,NONE,NONE,12]

If you really just took out 1 OSD, having 2xNONE in the acting set indicates that your crush rule can't find valid mappings. You might need to tune crush tunables: https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-pg/?highlight=crush%20gives%20up#troubleshooting-pgs

Look closely: that's the *acting* (second column) OSD set, not the *up*
(first column) OSD set. It's supposed to be the *previous* set of OSDs
assigned to that PG, but inexplicably some OSDs just "fall off" when the
PGs get remapped around.

Simply waiting lets the data recover. At no point are any of my PGs
actually missing OSDs according to the current cluster state, and CRUSH
always finds a valid mapping. Rather the problem is that the *previous*
set of OSDs just loses some entries some for some reason.

The same problem happens when I *add* an OSD to the cluster. For
example, right now, osd.15 is out. This is the state of one pg:

14.3d       1044                   0         0          0        0
15730756731            0           0  1630         0      1630
active+clean  2024-01-22T20:15:46.684066+0900     15550'1630
15550:16184  [18,17,16,3,1,0,11,14,12]          18
[18,17,16,3,1,0,11,14,12]              18     15550'1629
2024-01-22T20:15:46.683491+0900              0'0
2024-01-08T15:18:21.654679+0900              0                    2
periodic scrub scheduled @ 2024-01-31T07:34:27.297723+0900
    1043                0

Note the OSD list ([18,17,16,3,1,0,11,14,12])

Then I bring osd.15 in and:

14.3d       1044                   0      1077          0        0
15730756731            0           0  1630         0      1630
active+recovery_wait+undersized+degraded+remapped
2024-01-22T22:52:22.700096+0900     15550'1630     15554:16163
[15,17,16,3,1,0,11,14,12]          15    [NONE,17,16,3,1,0,11,14,12]
         17     15550'1629  2024-01-22T20:15:46.683491+0900
0'0  2024-01-08T15:18:21.654679+0900              0                    2
 periodic scrub scheduled @ 2024-01-31T02:31:53.342289+0900
     1043                0

So somehow osd.18 "vanished" from the acting list
([NONE,17,16,3,1,0,11,14,12]) as it is being replaced by 15 in the new
up list ([15,17,16,3,1,0,11,14,12]). The data is in osd.18, but somehow
Ceph forgot.


It is possible that your low OSD count causes the "crush gives up too soon" issue. You might also consider to use a crush rule that places exactly 3 shards per host (examples were in posts just last week). Otherwise, it is not guaranteed that "... data remains available if a whole host goes down ..." because you might have 4 chunks on one of the hosts and fall below min_size (the failure domain of your crush rule for the EC profiles is OSD).

That should be what my CRUSH rule does. It picks 3 hosts then picks 3
OSDs per host (IIUC). And oddly enough everything works for the other EC
pool even though it shares the same CRUSH rule (just ignoring one OSD
from it).

To test if your crush rules can generate valid mappings, you can pull the osdmap of your cluster and use osdmaptool to experiment with it without risk of destroying anything. It allows you to try different crush rules and failure scenarios on off-line but real cluster meta-data.

CRUSH steady state isn't the issue here, it's the dynamic state when
moving data that is the problem :)


Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Hector Martin <marcan@xxxxxxxxx>
Sent: Friday, January 19, 2024 10:12 AM
To: ceph-users@xxxxxxx
Subject:  Degraded PGs on EC pool when marking an OSD out

I'm having a bit of a weird issue with cluster rebalances with a new EC
pool. I have a 3-machine cluster, each machine with 4 HDD OSDs (+1 SSD).
Until now I've been using an erasure coded k=5 m=3 pool for most of my
data. I've recently started to migrate to a k=5 m=4 pool, so I can
configure the CRUSH rule to guarantee that data remains available if a
whole host goes down (3 chunks per host, 9 total). I also moved the 5,3
pool to this setup, although by nature I know its PGs will become
inactive if a host goes down (need at least k+1 OSDs to be up).

I've only just started migrating data to the 5,4 pool, but I've noticed
that any time I trigger any kind of backfilling (e.g. take one OSD out),
a bunch of PGs in the 5,4 pool become degraded (instead of just
misplaced/backfilling). This always seems to happen on that pool only,
and the object count is a significant fraction of the total pool object
count (it's not just "a few recently written objects while PGs were
repeering" or anything like that, I know about that effect).

Here are the pools:

pool 13 'cephfs2_data_hec5.3' erasure profile ec5.3 size 8 min_size 6
crush_rule 7 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode
warn last_change 14133 lfor 0/11307/11305 flags
hashpspool,ec_overwrites,bulk stripe_width 20480 application cephfs
pool 14 'cephfs2_data_hec5.4' erasure profile ec5.4 size 9 min_size 6
crush_rule 7 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode
warn last_change 14509 lfor 0/0/14234 flags
hashpspool,ec_overwrites,bulk stripe_width 20480 application cephfs

EC profiles:

# ceph osd erasure-code-profile get ec5.3
crush-device-class=
crush-failure-domain=osd
crush-root=default
jerasure-per-chunk-alignment=false
k=5
m=3
plugin=jerasure
technique=reed_sol_van
w=8

# ceph osd erasure-code-profile get ec5.4
crush-device-class=
crush-failure-domain=osd
crush-root=default
jerasure-per-chunk-alignment=false
k=5
m=4
plugin=jerasure
technique=reed_sol_van
w=8

They both use the same CRUSH rule, which is designed to select 9 OSDs
balanced across the hosts (of which only 8 slots get used for the older
5,3 pool):

rule hdd-ec-x3 {
        id 7
        type erasure
        step set_chooseleaf_tries 5
        step set_choose_tries 100
        step take default class hdd
        step choose indep 3 type host
        step choose indep 3 type osd
        step emit
}

If I take out an OSD (14), I get something like this:

    health: HEALTH_WARN
            Degraded data redundancy: 37631/120155160 objects degraded
(0.031%), 38 pgs degraded

All the degraded PGs are in the 5,4 pool, and the total object count is
around 50k, so this is *most* of the data in the pool becoming degraded
just because I marked an OSD out (without stopping it). If I mark the
OSD in again, the degraded state goes away.

Example degraded PGs:

# ceph pg dump | grep degraded
dumped all
14.3c        812                   0       838          0        0
11925027758            0           0  1088         0      1088
active+recovery_wait+undersized+degraded+remapped
2024-01-19T18:06:41.786745+0900     15440'1088     15486:10772
[18,17,16,1,3,2,11,13,12]          18    [18,17,16,1,3,2,11,NONE,12]
         18      14537'432  2024-01-12T11:25:54.168048+0900
0'0  2024-01-08T15:18:21.654679+0900              0                    2
 periodic scrub scheduled @ 2024-01-21T08:00:23.572904+0900
      241                0
14.3d        772                   0      1602          0        0
11303280223            0           0  1283         0      1283
active+recovery_wait+undersized+degraded+remapped
2024-01-19T18:06:41.919971+0900     15470'1283     15486:13384
[18,17,16,3,1,0,13,11,12]          18  [18,17,16,3,1,0,NONE,NONE,12]
         18      14990'771  2024-01-15T12:15:59.397469+0900
0'0  2024-01-08T15:18:21.654679+0900              0                    3
 periodic scrub scheduled @ 2024-01-23T15:56:58.912801+0900
      534                0
14.3e        806                   0       832          0        0
11843019697            0           0  1035         0      1035
active+recovery_wait+undersized+degraded+remapped
2024-01-19T18:06:42.297251+0900     15465'1035     15486:15423
[18,16,17,12,13,11,1,3,0]          18    [18,16,17,12,13,NONE,1,3,0]
         18      14623'500  2024-01-13T08:54:55.709717+0900
0'0  2024-01-08T15:18:21.654679+0900              0                    1
 periodic scrub scheduled @ 2024-01-22T09:54:51.278368+0900
      331                0
14.3f        782                   0       813          0        0
11598393034            0           0  1083         0      1083
active+recovery_wait+undersized+degraded+remapped
2024-01-19T18:06:41.845173+0900     15465'1083     15486:18496
[17,18,16,3,0,1,11,12,13]          17    [17,18,16,3,0,1,11,NONE,13]
         17      14990'800  2024-01-15T16:42:08.037844+0900
14990'800  2024-01-15T16:42:08.037844+0900              0
   40  periodic scrub scheduled @ 2024-01-23T10:44:06.083985+0900
            563                0

The first PG when I put the OSD back in:

14.3c        812                   0         0          0        0
11925027758            0           0  1088         0      1088
        active+clean  2024-01-19T18:07:18.079295+0900     15440'1088
15489:10792  [18,17,16,1,3,2,11,14,12]          18
[18,17,16,1,3,2,11,14,12]              18      14537'432
2024-01-12T11:25:54.168048+0900              0'0
2024-01-08T15:18:21.654679+0900              0                    2
periodic scrub scheduled @ 2024-01-21T09:41:43.026836+0900
     241                0

As far as I know PGs are not supposed to actually become *degraded* when
merely moving data around without any OSDs going down. Am I doing
something wrong here? Any idea why this is affecting one pool and not
both, even though they are almost identical in setup? It's as if, for
this one pool, marking an OSD out has the effect of making its data
unavailable entirely, instead of merely backfill to other OSDs (the OSD
shows up as NONE in the above dump).

OSD tree:

ID   CLASS  WEIGHT    TYPE NAME          STATUS  REWEIGHT  PRI-AFF
 -1         89.13765  root default
-13         29.76414      host flamingo
 11    hdd   7.27739          osd.11         up   1.00000  1.00000
 12    hdd   7.27739          osd.12         up   1.00000  1.00000
 13    hdd   7.27739          osd.13         up   1.00000  1.00000
 14    hdd   7.20000          osd.14         up   1.00000  1.00000
  8    ssd   0.73198          osd.8          up   1.00000  1.00000
-10         29.84154      host heart
  0    hdd   7.27739          osd.0          up   1.00000  1.00000
  1    hdd   7.27739          osd.1          up   1.00000  1.00000
  2    hdd   7.27739          osd.2          up   1.00000  1.00000
  3    hdd   7.27739          osd.3          up   1.00000  1.00000
  9    ssd   0.73198          osd.9          up   1.00000  1.00000
 -3                0      host hub
 -7         29.53197      host soleil
 15    hdd   7.20000          osd.15         up         0  1.00000
 16    hdd   7.20000          osd.16         up   1.00000  1.00000
 17    hdd   7.20000          osd.17         up   1.00000  1.00000
 18    hdd   7.20000          osd.18         up   1.00000  1.00000
 10    ssd   0.73198          osd.10         up   1.00000  1.00000

(I'm in the middle of doing some reprovisioning so 15 is out, this
happens any time I take any OSD out)

# ceph --version
ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)

- Hector
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx


- Hector
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux