I think we are zero'ing in now on root cause for the stuck incomplete. Looks like the common factor for all our stuck PGs is that they are all showing the removed OSD 8 in their "down_osds_we_would_probe" list (from "ceph pg <id> query"). For reference, I found a few archived threads of other people experiencing similar problems in the past: The general consensus from those threads is that as long as down_osds_we_would_probe is pointing to any OSD that can't be reached, those PGs will remain stuck incomplete and can't be cured by force_create_pg or even "ceph osd lost". Question: is there any command we can run to remove the old OSD from down_osds_we_would_probe? I did try to create an new "fake" OSD.8 today (just created the OSD, but didn't bring it all the way up), and I was able to finally run "ceph osd lost 8". Did not seem to have any impact. If there is no command to removed the old OSD, I think our next step will be to bring up a new/real/empty OSD.8 and see if that will clear the log jam. But seems like there should be a tool to deal with this kind of thing? Thanks, -- Dan
|
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com