Hi, I'm trying to fix an issue within 0.93 on our internal cloud related to incomplete pg's (yes, I realise the folly of having the dev release - it's a not-so-test env now, so I need to recover this really). I'll detail the current outage info; 72 initial (now 65) OSDs 6 nodes * Update to 0.92 from Giant. * Fine for a day * MDS outage overnight and subsequent node failure * Massive increase in RAM utilisation (10G per OSD!) * More failure * OSD's 'out' to try to alleviate new large cluster requirements and a couple died under additional load * 'superfluous and faulty' OSD's rm, auth keys deleted * RAM added to nodes (96GB each - serving 10-12 OSDs) * Ugrade to 0.93 * Fix broken journals due to 0.92 update * No more missing objects or degredation So, that brings me to today, I still have 73/2264 PGs listed as stuck incomplete/inactive. I also have requests that are blocked. Upon querying said placement groups, I notice that they are 'blocked_by' non-existent OSDs (ones I have removed due to issues). I have no way to tell them the OSD is lost (as it'a already been removed, both from osdmap and crushmap). Exporting the crushmap shows non-existant OSDs as deviceN (i.e. device36 for the removed osd.36) Deleting those and reimporting crush map makes no affect Some further pg detail - https://gist.github.com/joelio/cecca9b48aca6d44451b So I'm stuck, I can't recover the pg's as I can't remove a non-existent OSD that the PG think's blocking it. Help graciously accepted! Joel -- $ echo "kpfmAdpoofdufevq/dp/vl" | perl -pe 's/(.)/chr(ord($1)-1)/ge' -- $ echo "kpfmAdpoofdufevq/dp/vl" | perl -pe 's/(.)/chr(ord($1)-1)/ge' _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com