Re: 1 PG stucked in "active+undersized+degraded for long time

Eugen Block <eblock@xxxxxx> · Wed, 26 Jul 2023 12:45:04 +0000

I can provide some more details, these were the recovery steps taken  
so far, they started from here (I don't know the whole/exact story  
though):

  70/868386704 objects unfound (0.000%)
  Reduced data availability: 8 pgs inactive, 8 pgs incomplete
  Possible data damage: 1 pg recovery_unfound
  Degraded data redundancy: 45558/8766139136 objects degraded  
(0.001%), 2 pgs degraded, 1 pg undersized

And with reducing min_size for the EC pools some of the inactive PGs  
were cleaned up. From the remaining 4 incomplete PGs they got further  
by marking them unfound_lost:

# ceph pg 15.f4f mark_unfound_lost delete
pg has 70 objects unfound and apparently lost marking

And now one PG is stuck degraded:

# ceph pg ls degraded
PG      OBJECTS DEGRADED MISPLACED UNFOUND BYTES       OMAP_BYTES*  
OMAP_KEYS* LOG  STATE                      SINCE VERSION        
REPORTED       UP                                                       
ACTING                                                  SCRUB_STAMP     
            DEEP_SCRUB_STAMP
15.28f0   44994    44994         0       0 55288092914           0      
     0 3077 active+undersized+degraded   93s 310625'599302  
310657:3603406 [2147483647,343,355,415,426,640,302,392,78,202,607]p343  
[2147483647,343,355,415,426,640,302,392,78,202,607]p343 2021-04-11  
03:18:39.164439 2021-04-10 01:42:16.182528

Setting osd.343 down didn't have any effect, I then suggested to  
increase set_choose_retries from 100 to 150 for the respective  
crush_rule (found a thread where that seemed to have helped), don't  
have a response to that yet. If nothing else helps, would it help  
marking the PG as unfound_lost (with data loss) help here?

Zitat von Anthony D'Atri <anthony.datri@xxxxxxxxx>:

Sometimes one can even get away with "ceph osd down 343" which  
doesn't affect the process.  I have had occasions when this goosed  
peering in a less-intrusive way.  I believe it just marks the OSD  
down in the mons' map, and when that makes it to the OSD, the OSD  
responds with "I'm not dead yet" and gets marked up again.

On Jul 20, 2023, at 13:50, Matthew Leonard (BLOOMBERG/ 120 PARK)  
<mleonard33@xxxxxxxxxxxxx> wrote:

Assuming you're running systemctl OSDs you can run the following  
command on the host that OSD 343 resides on.

systemctl restart ceph-osd@343

From: siddhit.renake@xxxxxxxxxx At: 07/20/23 13:44:36 UTC-4:00To:   
ceph-users@xxxxxxx
Subject:  Re: 1 PG stucked in  
"active+undersized+degraded for long time

What should be appropriate way to restart primary OSD in this case (343) ?
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx