On 8/11/20 2:52 AM, Wido den Hollander wrote:
On 11/08/2020 00:40, Michael Thomas wrote:
On my relatively new Octopus cluster, I have one PG that has been
perpetually stuck in the 'unknown' state. It appears to belong to the
device_health_metrics pool, which was created automatically by the mgr
daemon(?).
The OSDs that the PG maps to are all online and serving other PGs.
But when I list the PGs that belong to the OSDs from 'ceph pg map',
the offending PG is not listed.
# ceph pg dump pgs | grep ^1.0
dumped pgs
1.0 0 0 0 0 0
0 0 0 0 0 unknown
2020-08-08T09:30:33.251653-0500 0'0 0:0 []
-1 [] -1 0'0
2020-08-08T09:30:33.251653-0500 0'0
2020-08-08T09:30:33.251653-0500 0
# ceph osd pool stats device_health_metrics
pool device_health_metrics id 1
nothing is going on
# ceph pg map 1.0
osdmap e7199 pg 1.0 (1.0) -> up [41,40,2] acting [41,0]
What can be done to fix the PG? I tried doing a 'ceph pg repair 1.0',
but that didn't seem to do anything.
Is it safe to try to update the crush_rule for this pool so that the
PG gets mapped to a fresh set of OSDs?
Yes, it would be. But still, it's weird. Mainly as the acting set is so
different from the up-set.
You have different CRUSH rules I think?
Marking those OSDs down might work, but otherwise change the crush_rule
and see how that goes.
Yes, I do have different crush rules to help map certain types of data
to different classes of hardware (EC HDDs, replicated SSDs, replicated
nvme). The default crush rule for the device_health_metrics pool was to
use replication across any storage device. I changed it to use the
replicated nvme crush rule, and now the map looks different:
# ceph pg map 1.0
osdmap e7256 pg 1.0 (1.0) -> up [24,22,12] acting [41,0]
However, the acting set of OSDs has not changed.
--Mike
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx