Hi folks, I've got a serious issue with a Ceph cluster that's used for RBD. There are 4 PGs stuck in an incomplete state and I'm trying to repair this problem to no avail. Here's ceph status: health HEALTH_WARN 4 pgs incomplete 4 pgs stuck inactive 4 pgs stuck unclean 100 requests are blocked > 32 sec monmap e13: 3 mons at ... election epoch 2084, quorum 0,1,2 mon4,mon5,mon3 osdmap e154083: 203 osds: 197 up, 197 in pgmap v37369382: 9856 pgs, 5 pools, 20932 GB data, 22321 kobjects 64871 GB used, 653 TB / 716 TB avail 9851 active+clean 4 incomplete 1 active+clean+scrubbing The 4 PGs all have the same primary OSD, which is on a host that had its OSDs turned off as it was quite flaky. 1.1bdb incomplete [52,100,130] 52 [52,100,130] 52 1.5c2 incomplete [52,191,109] 52 [52,191,109] 52 1.f98 incomplete [52,92,37] 52 [52,92,37] 52 1.11dc incomplete [52,176,12] 52 [52,176,12] 52 One thing that strikes me as odd is that once osd.52 is taken out, these sets change completely. The situation currently is that, for each of these PGs, the three OSDs have different amounts of data. They all have similar but different amounts, with osd.52 having the smallest amount (not by too much though) in each case. Querying those PGs doesn't return a response after a few minutes, manually triggering scrubs or repairs on them does nothing. I've lowered the min_size from 2 to 1 but I'm not seeing any activity to fix this. Is there something that can be done to recover without losing that data (it means each VM has a 75% chance of being destroyed)? _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com