Re: Degraded PGs on EC pool when marking an OSD out

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

Hector also claims that he observed an incomplete acting set after *adding* an OSD. Assuming that the cluster was health OK before that, that should not happen in theory. In practice this was observed with certain definitions of crush maps. There is, for example, the issue with "choose" and "chooseleaf" not doing the same thing in situations they should. Another one was that spurious (temporary) allocations of PGs could exceed hard limits without being obvious or reported at all. Without seeing the crush maps its hard to tell what is going on. With just 3 hosts and 4 OSDs per hosts the cluster might be hitting corner cases with such a wide EC profile.

Having the osdmap of the cluster in normal conditions would allow to simulate OSD downs and ups off-line and one might gain inside why crush fails to compute a complete acting set (yes, I'm not talking about the up set, I was always talking about the acting set). There might also be an issue with the PG-/OSD-map logs tracking the full history of the PGs in question.

A possible way to test is to issue a re-peer command after all peering finished on a PG with incomplete acting set to see if this resolves the PG. If so, there is a temporary condition that prevents the PGs from becoming clean when going through the standard peering procedure.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Eugen Block <eblock@xxxxxx>
Sent: Wednesday, January 24, 2024 9:45 AM
To: ceph-users@xxxxxxx
Subject:  Re: Degraded PGs on EC pool when marking an OSD out

Hi,

this topic pops up every now and then, and although I don't have
definitive proof for my assumptions I still stand with them. ;-)
As the docs [2] already state, it's expected that PGs become degraded
after some sort of failure (setting an OSD "out" falls into that
category IMO):

> It is normal for placement groups to enter “degraded” or “peering”
> states after a component failure. Normally, these states reflect the
> expected progression through the failure recovery process. However,
> a placement group that stays in one of these states for a long time
> might be an indication of a larger problem.

And you report that your PGs do not stay in that state but eventually
recover. My understanding is as follows:
PGs have to be recreated on different hosts/OSDs after setting an OSD
"out". During this transition (peering) the PGs are degraded until the
newly assigned OSD have noticed their new responsibility (I'm not
familiar with the actual data flow). The degraded state then clears as
long as the out OSD is up (its PGs are active). If you stop that OSD
("down") the PGs become and stay degraded until they have been fully
recreated on different hosts/OSDs. Not sure what impacts the duration
until the degraded state clears, but in my small test cluster (similar
osd tree as yours) the degraded state clears after a few seconds only,
but I only have a few (almost empty) PGs in the EC test pool.

I guess a comment from the devs couldn't hurt to clear this up.

[2]
https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-pg/#stuck-placement-groups

Zitat von Hector Martin <marcan@xxxxxxxxx>:

> On 2024/01/22 19:06, Frank Schilder wrote:
>> You seem to have a problem with your crush rule(s):
>>
>> 14.3d ... [18,17,16,3,1,0,NONE,NONE,12]
>>
>> If you really just took out 1 OSD, having 2xNONE in the acting set
>> indicates that your crush rule can't find valid mappings. You might
>> need to tune crush tunables:
>> https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-pg/?highlight=crush%20gives%20up#troubleshooting-pgs
>
> Look closely: that's the *acting* (second column) OSD set, not the *up*
> (first column) OSD set. It's supposed to be the *previous* set of OSDs
> assigned to that PG, but inexplicably some OSDs just "fall off" when the
> PGs get remapped around.
>
> Simply waiting lets the data recover. At no point are any of my PGs
> actually missing OSDs according to the current cluster state, and CRUSH
> always finds a valid mapping. Rather the problem is that the *previous*
> set of OSDs just loses some entries some for some reason.
>
> The same problem happens when I *add* an OSD to the cluster. For
> example, right now, osd.15 is out. This is the state of one pg:
>
> 14.3d       1044                   0         0          0        0
> 15730756731            0           0  1630         0      1630
> active+clean  2024-01-22T20:15:46.684066+0900     15550'1630
> 15550:16184  [18,17,16,3,1,0,11,14,12]          18
> [18,17,16,3,1,0,11,14,12]              18     15550'1629
> 2024-01-22T20:15:46.683491+0900              0'0
> 2024-01-08T15:18:21.654679+0900              0                    2
> periodic scrub scheduled @ 2024-01-31T07:34:27.297723+0900
>     1043                0
>
> Note the OSD list ([18,17,16,3,1,0,11,14,12])
>
> Then I bring osd.15 in and:
>
> 14.3d       1044                   0      1077          0        0
> 15730756731            0           0  1630         0      1630
> active+recovery_wait+undersized+degraded+remapped
> 2024-01-22T22:52:22.700096+0900     15550'1630     15554:16163
> [15,17,16,3,1,0,11,14,12]          15    [NONE,17,16,3,1,0,11,14,12]
>          17     15550'1629  2024-01-22T20:15:46.683491+0900
> 0'0  2024-01-08T15:18:21.654679+0900              0                    2
>  periodic scrub scheduled @ 2024-01-31T02:31:53.342289+0900
>      1043                0
>
> So somehow osd.18 "vanished" from the acting list
> ([NONE,17,16,3,1,0,11,14,12]) as it is being replaced by 15 in the new
> up list ([15,17,16,3,1,0,11,14,12]). The data is in osd.18, but somehow
> Ceph forgot.
>
>>
>> It is possible that your low OSD count causes the "crush gives up
>> too soon" issue. You might also consider to use a crush rule that
>> places exactly 3 shards per host (examples were in posts just last
>> week). Otherwise, it is not guaranteed that "... data remains
>> available if a whole host goes down ..." because you might have 4
>> chunks on one of the hosts and fall below min_size (the failure
>> domain of your crush rule for the EC profiles is OSD).
>
> That should be what my CRUSH rule does. It picks 3 hosts then picks 3
> OSDs per host (IIUC). And oddly enough everything works for the other EC
> pool even though it shares the same CRUSH rule (just ignoring one OSD
> from it).
>
>> To test if your crush rules can generate valid mappings, you can
>> pull the osdmap of your cluster and use osdmaptool to experiment
>> with it without risk of destroying anything. It allows you to try
>> different crush rules and failure scenarios on off-line but real
>> cluster meta-data.
>
> CRUSH steady state isn't the issue here, it's the dynamic state when
> moving data that is the problem :)
>
>>
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: Hector Martin <marcan@xxxxxxxxx>
>> Sent: Friday, January 19, 2024 10:12 AM
>> To: ceph-users@xxxxxxx
>> Subject:  Degraded PGs on EC pool when marking an OSD out
>>
>> I'm having a bit of a weird issue with cluster rebalances with a new EC
>> pool. I have a 3-machine cluster, each machine with 4 HDD OSDs (+1 SSD).
>> Until now I've been using an erasure coded k=5 m=3 pool for most of my
>> data. I've recently started to migrate to a k=5 m=4 pool, so I can
>> configure the CRUSH rule to guarantee that data remains available if a
>> whole host goes down (3 chunks per host, 9 total). I also moved the 5,3
>> pool to this setup, although by nature I know its PGs will become
>> inactive if a host goes down (need at least k+1 OSDs to be up).
>>
>> I've only just started migrating data to the 5,4 pool, but I've noticed
>> that any time I trigger any kind of backfilling (e.g. take one OSD out),
>> a bunch of PGs in the 5,4 pool become degraded (instead of just
>> misplaced/backfilling). This always seems to happen on that pool only,
>> and the object count is a significant fraction of the total pool object
>> count (it's not just "a few recently written objects while PGs were
>> repeering" or anything like that, I know about that effect).
>>
>> Here are the pools:
>>
>> pool 13 'cephfs2_data_hec5.3' erasure profile ec5.3 size 8 min_size 6
>> crush_rule 7 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode
>> warn last_change 14133 lfor 0/11307/11305 flags
>> hashpspool,ec_overwrites,bulk stripe_width 20480 application cephfs
>> pool 14 'cephfs2_data_hec5.4' erasure profile ec5.4 size 9 min_size 6
>> crush_rule 7 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode
>> warn last_change 14509 lfor 0/0/14234 flags
>> hashpspool,ec_overwrites,bulk stripe_width 20480 application cephfs
>>
>> EC profiles:
>>
>> # ceph osd erasure-code-profile get ec5.3
>> crush-device-class=
>> crush-failure-domain=osd
>> crush-root=default
>> jerasure-per-chunk-alignment=false
>> k=5
>> m=3
>> plugin=jerasure
>> technique=reed_sol_van
>> w=8
>>
>> # ceph osd erasure-code-profile get ec5.4
>> crush-device-class=
>> crush-failure-domain=osd
>> crush-root=default
>> jerasure-per-chunk-alignment=false
>> k=5
>> m=4
>> plugin=jerasure
>> technique=reed_sol_van
>> w=8
>>
>> They both use the same CRUSH rule, which is designed to select 9 OSDs
>> balanced across the hosts (of which only 8 slots get used for the older
>> 5,3 pool):
>>
>> rule hdd-ec-x3 {
>>         id 7
>>         type erasure
>>         step set_chooseleaf_tries 5
>>         step set_choose_tries 100
>>         step take default class hdd
>>         step choose indep 3 type host
>>         step choose indep 3 type osd
>>         step emit
>> }
>>
>> If I take out an OSD (14), I get something like this:
>>
>>     health: HEALTH_WARN
>>             Degraded data redundancy: 37631/120155160 objects degraded
>> (0.031%), 38 pgs degraded
>>
>> All the degraded PGs are in the 5,4 pool, and the total object count is
>> around 50k, so this is *most* of the data in the pool becoming degraded
>> just because I marked an OSD out (without stopping it). If I mark the
>> OSD in again, the degraded state goes away.
>>
>> Example degraded PGs:
>>
>> # ceph pg dump | grep degraded
>> dumped all
>> 14.3c        812                   0       838          0        0
>> 11925027758            0           0  1088         0      1088
>> active+recovery_wait+undersized+degraded+remapped
>> 2024-01-19T18:06:41.786745+0900     15440'1088     15486:10772
>> [18,17,16,1,3,2,11,13,12]          18    [18,17,16,1,3,2,11,NONE,12]
>>          18      14537'432  2024-01-12T11:25:54.168048+0900
>> 0'0  2024-01-08T15:18:21.654679+0900              0                    2
>>  periodic scrub scheduled @ 2024-01-21T08:00:23.572904+0900
>>       241                0
>> 14.3d        772                   0      1602          0        0
>> 11303280223            0           0  1283         0      1283
>> active+recovery_wait+undersized+degraded+remapped
>> 2024-01-19T18:06:41.919971+0900     15470'1283     15486:13384
>> [18,17,16,3,1,0,13,11,12]          18  [18,17,16,3,1,0,NONE,NONE,12]
>>          18      14990'771  2024-01-15T12:15:59.397469+0900
>> 0'0  2024-01-08T15:18:21.654679+0900              0                    3
>>  periodic scrub scheduled @ 2024-01-23T15:56:58.912801+0900
>>       534                0
>> 14.3e        806                   0       832          0        0
>> 11843019697            0           0  1035         0      1035
>> active+recovery_wait+undersized+degraded+remapped
>> 2024-01-19T18:06:42.297251+0900     15465'1035     15486:15423
>> [18,16,17,12,13,11,1,3,0]          18    [18,16,17,12,13,NONE,1,3,0]
>>          18      14623'500  2024-01-13T08:54:55.709717+0900
>> 0'0  2024-01-08T15:18:21.654679+0900              0                    1
>>  periodic scrub scheduled @ 2024-01-22T09:54:51.278368+0900
>>       331                0
>> 14.3f        782                   0       813          0        0
>> 11598393034            0           0  1083         0      1083
>> active+recovery_wait+undersized+degraded+remapped
>> 2024-01-19T18:06:41.845173+0900     15465'1083     15486:18496
>> [17,18,16,3,0,1,11,12,13]          17    [17,18,16,3,0,1,11,NONE,13]
>>          17      14990'800  2024-01-15T16:42:08.037844+0900
>> 14990'800  2024-01-15T16:42:08.037844+0900              0
>>    40  periodic scrub scheduled @ 2024-01-23T10:44:06.083985+0900
>>             563                0
>>
>> The first PG when I put the OSD back in:
>>
>> 14.3c        812                   0         0          0        0
>> 11925027758            0           0  1088         0      1088
>>         active+clean  2024-01-19T18:07:18.079295+0900     15440'1088
>> 15489:10792  [18,17,16,1,3,2,11,14,12]          18
>> [18,17,16,1,3,2,11,14,12]              18      14537'432
>> 2024-01-12T11:25:54.168048+0900              0'0
>> 2024-01-08T15:18:21.654679+0900              0                    2
>> periodic scrub scheduled @ 2024-01-21T09:41:43.026836+0900
>>      241                0
>>
>> As far as I know PGs are not supposed to actually become *degraded* when
>> merely moving data around without any OSDs going down. Am I doing
>> something wrong here? Any idea why this is affecting one pool and not
>> both, even though they are almost identical in setup? It's as if, for
>> this one pool, marking an OSD out has the effect of making its data
>> unavailable entirely, instead of merely backfill to other OSDs (the OSD
>> shows up as NONE in the above dump).
>>
>> OSD tree:
>>
>> ID   CLASS  WEIGHT    TYPE NAME          STATUS  REWEIGHT  PRI-AFF
>>  -1         89.13765  root default
>> -13         29.76414      host flamingo
>>  11    hdd   7.27739          osd.11         up   1.00000  1.00000
>>  12    hdd   7.27739          osd.12         up   1.00000  1.00000
>>  13    hdd   7.27739          osd.13         up   1.00000  1.00000
>>  14    hdd   7.20000          osd.14         up   1.00000  1.00000
>>   8    ssd   0.73198          osd.8          up   1.00000  1.00000
>> -10         29.84154      host heart
>>   0    hdd   7.27739          osd.0          up   1.00000  1.00000
>>   1    hdd   7.27739          osd.1          up   1.00000  1.00000
>>   2    hdd   7.27739          osd.2          up   1.00000  1.00000
>>   3    hdd   7.27739          osd.3          up   1.00000  1.00000
>>   9    ssd   0.73198          osd.9          up   1.00000  1.00000
>>  -3                0      host hub
>>  -7         29.53197      host soleil
>>  15    hdd   7.20000          osd.15         up         0  1.00000
>>  16    hdd   7.20000          osd.16         up   1.00000  1.00000
>>  17    hdd   7.20000          osd.17         up   1.00000  1.00000
>>  18    hdd   7.20000          osd.18         up   1.00000  1.00000
>>  10    ssd   0.73198          osd.10         up   1.00000  1.00000
>>
>> (I'm in the middle of doing some reprovisioning so 15 is out, this
>> happens any time I take any OSD out)
>>
>> # ceph --version
>> ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)
>>
>> - Hector
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>
> - Hector
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux