You'll probably have to recreate osds with the same ids (empty ones), let them boot, stop them, and mark them lost. There is a feature in the tracker to improve this behavior: http://tracker.ceph.com/issues/10976 -Sam On Mon, 2015-03-09 at 12:24 +0000, joel.merrick@xxxxxxxxx wrote: > Hi, > > I'm trying to fix an issue within 0.93 on our internal cloud related > to incomplete pg's (yes, I realise the folly of having the dev release > - it's a not-so-test env now, so I need to recover this really). I'll > detail the current outage info; > > 72 initial (now 65) OSDs > 6 nodes > > * Update to 0.92 from Giant. > * Fine for a day > * MDS outage overnight and subsequent node failure > * Massive increase in RAM utilisation (10G per OSD!) > * More failure > * OSD's 'out' to try to alleviate new large cluster requirements and a > couple died under additional load > * 'superfluous and faulty' OSD's rm, auth keys deleted > * RAM added to nodes (96GB each - serving 10-12 OSDs) > * Ugrade to 0.93 > * Fix broken journals due to 0.92 update > * No more missing objects or degredation > > So, that brings me to today, I still have 73/2264 PGs listed as stuck > incomplete/inactive. I also have requests that are blocked. > > Upon querying said placement groups, I notice that they are > 'blocked_by' non-existent OSDs (ones I have removed due to issues). > I have no way to tell them the OSD is lost (as it'a already been > removed, both from osdmap and crushmap). > Exporting the crushmap shows non-existant OSDs as deviceN (i.e. > device36 for the removed osd.36) > Deleting those and reimporting crush map makes no affect > > Some further pg detail - https://gist.github.com/joelio/cecca9b48aca6d44451b > > > So I'm stuck, I can't recover the pg's as I can't remove a > non-existent OSD that the PG think's blocking it. > > Help graciously accepted! > Joel > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com