Re: stuck with active+undersized+degraded on Jewel after cluster maintenance

Paweł Sadowsk <ceph@xxxxxxxxx> · Fri, 3 Aug 2018 14:06:38 +0200



On 08/03/2018 01:45 PM, Pawel S wrote:
> hello!
>
> We did maintenance works (cluster shrinking) on one cluster (jewel)
> and after shutting one of osds down we found this situation where
> recover of pg can't be started because of "querying" one of peers. We
> restarted this OSD, tried to out and in. Nothing helped, finally we
> moved out data (the pg was still on it) and removed this osd from
> crush and whole cluster. But recover can't start on any other osd to
> create this copy again. We still have valid active 2 copies, but we
> would like to have it clean. 
> How we can push recover to have this third copy somewhere ?
> Replication size is 3 on hosts and there are plenty of them.  
>
> Status now: 
>    health HEALTH_WARN
>             1 pgs degraded
>             1 pgs stuck degraded
>             1 pgs stuck unclean
>             1 pgs stuck undersized
>             1 pgs undersized
>             recovery 268/19265130 objects degraded (0.001%)
>
> Link to PG query details, health status and version commit here:
> https://gist.github.com/pejotes/aea71ecd2718dbb3ceab0e648924d06b
Can you add 'ceph osd tree', 'ceph osd crush show-tunables' and 'ceph
osd crush rule dump'? Looks like crush is not able to find place for 3rd
copy due to big difference in weight of rack/host depending on your
crush rules.

-- 
PS
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com