stuck with active+undersized+degraded on Jewel after cluster maintenance

Pawel S <pejotes@xxxxxxxxx> · Fri, 3 Aug 2018 13:45:36 +0200

hello!

We did maintenance works (cluster shrinking) on one cluster (jewel) and after shutting one of osds down we found this situation where recover of pg can't be started because of "querying" one of peers. We restarted this OSD, tried to out and in. Nothing helped, finally we moved out data (the pg was still on it) and removed this osd from crush and whole cluster. But recover can't start on any other osd to create this copy again. We still have valid active 2 copies, but we would like to have it clean. 
How we can push recover to have this third copy somewhere ? Replication size is 3 on hosts and there are plenty of them.  

Status now: 
   health HEALTH_WARN
            1 pgs degraded
            1 pgs stuck degraded
            1 pgs stuck unclean
            1 pgs stuck undersized
            1 pgs undersized
            recovery 268/19265130 objects degraded (0.001%)

Link to PG query details, health status and version commit here:
https://gist.github.com/pejotes/aea71ecd2718dbb3ceab0e648924d06b

best regards!
-- 
Pawel
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com