Hello, Warning, this is a long story! There's a TL;DR; close to the end. We are replacing some of our spinning drives with SSDs. We have 14 OSD nodes with 12 drives each. We are replacing 4 drives from each node with SSDs. The cluster is running Ceph Jewel (10.2.7). The affected pool had min_size=2 and size=3. After removing some of the drives (from a single host) we noticed the rebalancing/recovering process got stuck and we had 1 PG with 2 unfound objects. Most of our Openstack VMs were having issues - were unresponsive or had other i/o issues. We tried quering the PG but had no response after hours of waiting. Trying to recover or delete the unfound objects did the same thing: absolutely nothing. One of the two remaining OSD nodes that had the PG was experiencing huge load spikes correlated with disk IO spikes: https://imgur.com/a/7g0eI We had this OSD removed and after a while the other OSD started doing the same thing - huge load spikes. Tried doing a query on the affected PG and deleting the unfound objects. Nothing had changed. The OSDs this PG was supposed to be replicated to only had and empty folder. We removed the last OSD that had the PG with unfound objects. Now we had an incomplete PG. Recovered the data from the OSD we removed before all this has started and tried exporting and importing the PG using the Ceph Object Store Tool. Unfortunately nothing happened. Also tried using the Ceph Object Store Tool to find and delete the unfound objects from the last two OSDs we had removed and re-import the PG but this also didn't work. *TL;DR;* we had 2 unfound objects on a PG after removing an OSD, cluster status was healthy before this, pool has min_size=2 and size=3. Had to delete the entire pool and recreate all the virtual machines. If you have any idea why the PG was not being replicated on the other two OSDs please let me know. Any sugestions on how to avoid this? Just want to make sure this never happens again. Our story is similar to this one: http://ceph-users.ceph.narkive.com/bWszhgi1/ceph-pg-incomplete-cluster-unusable#post19 --- Alex Cucu _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com