Re: pg stuck in unknown state

Wido den Hollander <wido@xxxxxxxx> · Tue, 11 Aug 2020 09:52:10 +0200

On 11/08/2020 00:40, Michael Thomas wrote:
On my relatively new Octopus cluster, I have one PG that has been 
perpetually stuck in the 'unknown' state.  It appears to belong to the 
device_health_metrics pool, which was created automatically by the mgr 
daemon(?).

The OSDs that the PG maps to are all online and serving other PGs.  But 
when I list the PGs that belong to the OSDs from 'ceph pg map', the 
offending PG is not listed.

# ceph pg dump pgs | grep ^1.0
dumped pgs
1.0            0                   0         0          0        0      
0            0           0      0         0       unknown 
2020-08-08T09:30:33.251653-0500         0'0         0:0        
[]          -1                         []              -1  0'0  
2020-08-08T09:30:33.251653-0500              0'0 
2020-08-08T09:30:33.251653-0500              0

# ceph osd pool stats device_health_metrics
pool device_health_metrics id 1
   nothing is going on

# ceph pg map 1.0
osdmap e7199 pg 1.0 (1.0) -> up [41,40,2] acting [41,0]

What can be done to fix the PG?  I tried doing a 'ceph pg repair 1.0', 
but that didn't seem to do anything.

Is it safe to try to update the crush_rule for this pool so that the PG 
gets mapped to a fresh set of OSDs?

Yes, it would be. But still, it's weird. Mainly as the acting set is so 
different from the up-set.

You have different CRUSH rules I think?

Marking those OSDs down might work, but otherwise change the crush_rule 
and see how that goes.

Wido

--Mike
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx