So I think I know what might have gone wrong. When I took might osd's out of the cluster and shut them down, the first set of osds likely came back up and in the cluster before 300 seconds expired. This would have prevented cluster triggering recovery of the pg from the replica osd. So the question is, can I force this to happen? Can I take the supposed primary osd down for 300+ seconds to allow the cluster to start recovering the pgs (this will of course affect all other pgs on the osds). Or is there a better way? Note that all my secondary osds in these pgs have the expected amount of data for the pg, remained up during the primary's downtime and should have the state to become the primary for the acting set. Thanks for listening. ~jpr On 03/25/2016 11:57 AM, John-Paul Robinson wrote: > Hi Folks, > > One last dip into my old bobtail cluster. (new hardware is on order) > > I have three pg in an incomplete state. The cluster was previously > stable but with a health warn state due to a few near full osds. I > started resizing drives on one host to expand space after taking the > osds that served them out and down. My failure domain is two levels > osds and hosts and have two copies per placement group. > > I have three of my pgs flagging incomplete. > > root@d90-b1-1c-3a-c4-8f:~# date; sudo ceph --id nova health detail | > grep incomplete > Fri Mar 25 11:28:47 CDT 2016 > HEALTH_WARN 168 pgs backfill; 107 pgs backfilling; 241 pgs degraded; 3 > pgs incomplete; 3 pgs stuck inactive; 287 pgs stuck unclean; recovery > 4913393/39589336 degraded (12.411%); recovering 120 o/s, 481MB/s; 4 > near full osd(s) > pg 3.5 is stuck inactive since forever, current state incomplete, last > acting [53,22] > pg 3.150 is stuck inactive since forever, current state incomplete, last > acting [50,74] > pg 3.38c is stuck inactive since forever, current state incomplete, last > acting [14,70] > pg 3.5 is stuck unclean since forever, current state incomplete, last > acting [53,22] > pg 3.150 is stuck unclean since forever, current state incomplete, last > acting [50,74] > pg 3.38c is stuck unclean since forever, current state incomplete, last > acting [14,70] > pg 3.38c is incomplete, acting [14,70] > pg 3.150 is incomplete, acting [50,74] > pg 3.5 is incomplete, acting [53,22] > > Given that incomplete means: > > "Ceph detects that a placement group is missing information about writes > that may have occurred, or does not have any healthy copies. If you see > this state, try to start any failed OSDs that may contain the needed > information or temporarily adjust min_size to allow recovery." > > I have restarted all osds in these acting sets and they log normally, > opening their respective journals and such. However, the incomplete > state remains. > > All three of the primary osds 53,50,14 have were reformatted to expand > size, so I know there's no "spare" journal if its referring to what was > there before. Btw, I did take all osds to out and down before resizing > their drives, so I'm not sure how these pg would actually be expecting > old journal. > > I suspect I need to forgo the journal and let the secondaries become > primary for these pg. > > I sure hope that's possible. > > As always, thanks for any pointers. > > ~jpr > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com