PG stuck at active+clean+remapped

Michael Fladischer <michael@xxxxxxxx> · Wed, 10 Mar 2021 00:02:32 +0100

Hi,

we have replaced some of our OSDs a while ago an while everything 
recovery as planned, one PG is still stuck at active+clean+remapped with 
no backfilling taking place.

Mpaaing the PG in question shows me that one OSD is missing:

$ ceph pg map 35.1fe
osdmap e1265760 pg 35.1fe (35.1fe) -> up 
[97,190,65,23,393,223,2147483647,354,132] acting 
[97,190,65,23,393,223,112,354,132]

It seems that osd.112 should be replaced with an other OSD and I suspect 
that CRUSH cannot find a suitable one.

Pool 35 is EC and has k=7 and m=2 and our Cluster has 9 OSD nodes. Is 
this just a case of CRUSH giving up to early as described in the 
troubleshooting PGs section[0] of the docs? Running the test as 
described there using `crushtool` gives several bad mapping rule results 
for "--num-rep 9".

If so, would it help to just add new OSDs to the existing hosts or would 
it be better to add a whole new OSD host?

Are there other options (upmap) to force this single PG to use a 
different set of OSDs for its "up" map?

[0] 
https://github.com/ceph/ceph/blob/master/doc/rados/troubleshooting/troubleshooting-pg.rst#crush-gives-up-too-soon

Thanks,
Michael
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx