How to just delete PGs stuck incomplete on EC pool

Daniel K <sathackr@xxxxxxxxx> · Fri, 1 Mar 2019 22:20:57 -0500

56 OSD, 6-node 12.2.5 cluster on Proxmox

We had multiple drives fail(about 30%) within a few days of each other, likely faster than the cluster could recover.
After the dust settled, we have 2 out of 896 pgs stuck inactive. The failed drives are completely inaccessible, so I can't mount them and attempt to export the PGs.

Do I have any options besides to just consider them lost -- and how do I tell Ceph they are lost so that I can get my cluster back to normal? I already reduced min_size from 9 to 8, can't reduce it any more. The list of "down_osds_we_would_probe" have already all been marked as lost (ceph osd lost xx)

ceph health detail:
<snip>
PG_AVAILABILITY Reduced data availability: 2 pgs inactive, 2 pgs incomplete
    pg 18.c is incomplete, acting [32,48,58,40,13,44,61,59,30,27,43,37] (reducing pool ec84-hdd-zm min_size from 8 may help; search ceph.com/docs for 'incomplete')
    pg 18.1e is incomplete, acting [50,49,41,58,60,46,52,37,34,63,57,16] (reducing pool ec84-hdd-zm min_size from 8 may help; search ceph.com/docs for 'incomplete')
<snip>

root@pve4:~# ceph osd erasure-code-profile get ec-84-hdd
crush-device-class=
crush-failure-domain=host
crush-root=default
k=8
m=4
plugin=isa
technique=reed_sol_van

Results of ceph pg 18.c query https://pastebin.com/V8nByRF6
Results of ceph pg 18.1e query https://pastebin.com/rBWwPYUn

Thanks

Dan

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com