One pg stuck in active+undersized+degraded after OSD down

David Tinker <david.tinker@xxxxxxxxx> · Thu, 18 Nov 2021 12:05:11 +0200

Hi Guys

I am busy removing an OSD from my rook-ceph cluster. I did 'ceph osd out
osd.7'  and the re-balancing process started. Now it has stalled with one
pg on "active+undersized+degraded". I have done this before and it has
worked fine.

# ceph health detail
HEALTH_WARN Degraded data redundancy: 15/94659 objects degraded (0.016%), 1
pg degraded, 1 pg undersized
[WRN] PG_DEGRADED: Degraded data redundancy: 15/94659 objects degraded
(0.016%), 1 pg degraded, 1 pg undersized
    pg 3.1f is stuck undersized for 2h, current state
active+undersized+degraded, last acting [0,2]

# ceph pg dump_stuck
PG_STAT  STATE                       UP     UP_PRIMARY  ACTING
 ACTING_PRIMARY
3.1f     active+undersized+degraded  [0,2]           0   [0,2]
  0

I have lots of OSDs on different nodes:

# ceph osd tree
ID   CLASS  WEIGHT    TYPE NAME                               STATUS
 REWEIGHT  PRI-AFF
 -1         13.77573  root default

 -5         13.77573      region FSN1

-22          0.73419          zone FSN1-DC13

-21                0              host node5-redacted-com

-27          0.73419              host node7-redacted-com

  1    ssd   0.36710                  osd.1                       up
1.00000  1.00000
  5    ssd   0.36710                  osd.5                       up
1.00000  1.00000
-10          6.20297          zone FSN1-DC14

 -9          6.20297              host node3-redacted-com

  2    ssd   3.10149                  osd.2                       up
1.00000  1.00000
  4    ssd   3.10149                  osd.4                       up
1.00000  1.00000
-18          3.19919          zone FSN1-DC15

-17          3.19919              host node4-redacted-com

  7    ssd   3.19919                  osd.7                     down
  0  1.00000
 -4          2.90518          zone FSN1-DC16

 -3          2.90518              host node1-redacted-com

  0    ssd   1.45259                  osd.0                       up
1.00000  1.00000
  3    ssd   1.45259                  osd.3                       up
1.00000  1.00000
-14          0.73419          zone FSN1-DC18

-13                0              host node2-redacted-com

-25          0.73419              host node6-redacted-com

 10    ssd   0.36710                  osd.10                      up
1.00000  1.00000
 11    ssd   0.36710                  osd.11                      up
1.00000  1.00000

Any ideas on how to fix this?

Thanks
David
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx