Stuck PGs blocked_by non-existent OSDs

"joel.merrick@xxxxxxxxx" <joel.merrick@xxxxxxxxx> · Mon, 9 Mar 2015 12:24:46 +0000

Hi,

I'm trying to fix an issue within 0.93 on our internal cloud related
to incomplete pg's (yes, I realise the folly of having the dev release
- it's a not-so-test env now, so I need to recover this really). I'll
detail the current outage info;

72 initial (now 65) OSDs
6 nodes

* Update to 0.92 from Giant.
* Fine for a day
* MDS outage overnight and subsequent node failure
* Massive increase in RAM utilisation (10G per OSD!)
* More failure
* OSD's 'out' to try to alleviate new large cluster requirements and a
couple died under additional load
* 'superfluous and faulty' OSD's rm, auth keys deleted
* RAM added to nodes (96GB each - serving 10-12 OSDs)
* Ugrade to 0.93
* Fix broken journals due to 0.92 update
* No more missing objects or degredation

So, that brings me to today, I still have 73/2264 PGs listed as stuck
incomplete/inactive. I also have requests that are blocked.

Upon querying said placement groups, I notice that they are
'blocked_by' non-existent OSDs (ones I have removed due to issues).
I have no way to tell them the OSD is lost (as it'a already been
removed, both from osdmap and crushmap).
Exporting the crushmap shows non-existant OSDs as deviceN (i.e.
device36 for the removed osd.36)
Deleting those and reimporting crush map makes no affect

Some further pg detail - https://gist.github.com/joelio/cecca9b48aca6d44451b

So I'm stuck, I can't recover the pg's as I can't remove a
non-existent OSD that the PG think's blocking it.

Help graciously accepted!
Joel

-- 
$ echo "kpfmAdpoofdufevq/dp/vl" | perl -pe 's/(.)/chr(ord($1)-1)/ge'
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com