I’d like to understand more why the down OSD would cause the PG to get stuck after CRUSH was able to locate enough OSD to map the PG. Is this some form of safety catch that prevents it from recovering, even though OSD.116 is no longer important for data integrity? Marking the OSD lost is an option here, but it’s not really lost … it just takes some time to get a machine rebooted. I’m still working out my operational procedures for CEPH and marking the OSD lost but having it pop back up once the system reboots could be an issue that I’m not yet sure how to resolve. Can an OSD be marked as ‘found’ once it returns to the network? -Chris From: Goncalo Borges <goncalo.borges@xxxxxxxxxxxxx> Hi Chris... The precise osd set you see now [79,8,74] was obtained on epoch 104536 but this was after a lot of tries as showed by the recovery section.
Actually, in the first try (on epoch 100767) osd 116 was selected somehow (maybe it was up at the time?) and probably the pg got stuck because it went down during the recover process?
The pg query also shows
Maybe, you can check the documentation in [1] and see if you think you could follow the suggestion inside the pg and mark osd 116 as lost. This should be done after proper evaluation from you. Another thing I found strange is that in the recovery section, there are a lot of tries where you do not get a proper osd set. The very last recover try was on epoch 104540.
{ From [2], "When CRUSH fails to find enough OSDs to map to a PG, it will show as a 2147483647 which is ITEM_NONE or no OSD found.".
This could be an artifact of the peering being blocked by osd.116, or a genuine problem where you are not being able to get a proper osd set. That could be for a variety of reasons: from network issues, to osds being almost full or simply because the system
can't get 3 osds in 3 different hosts. Cheers Goncalo [2]
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/ On 08/16/2016 11:42 AM, Heller, Chris wrote:
-- Goncalo Borges Research Computing ARC Centre of Excellence for Particle Physics at the Terascale School of Physics A28 | University of Sydney, NSW 2006 T: +61 2 93511937 |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com