On 05/11/2016 04:00 PM, Wido den Hollander wrote: > >> Op 11 mei 2016 om 15:51 schreef Simon Engelsman <simon@xxxxxxxxxxxx>: >> >> >> Hello everyone, >> >> We experienced a strange scenario last week of unfound objects and >> inconsistent reports from ceph tools. We solved it with the help from >> Sage, and we wanted to share our experience and to see if it can be of >> any use for developers too. >> > > Looks like you went through some weird issues! Yes, we had some fun with one of our ceph clusters :) > >> After OSDs segfaulting randomly, our cluster ended up with one OSD down >> and unfound objects, probably due to a combination of inopportune >> crashes. We tried to start that OSD again, but it crashed when reading a >> specific PG from the log. here: http://pastebin.com/u9WFJnMR >> >> Sage pointed that it looked like some metadata was corrupted. Funny >> thing is that, that PG didn't belong to that OSD anymore. Once we made >> sure it didn't belong to that OSD, we removed the PG from that OSD. >> >> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-51/ --pgid >> 2.1fd --op remove --journal-path /var/lib/ceph/osd/ceph-51/journal >> >> We had to repeat this procedure for other PGs on that same OSD, as it >> kept on crashing on startup. Finally the OSD was up and in, but the >> recovery process was stuck with 10 unfound objects. We deleted marking >> them as lost in their PGs doing: >> > > Wasn't this specific osd, number 51, also involved in the issues you had earlier? No, actually it was a different cluster. > >> ceph pg 2.481 mark_unfound_lost delete >> >> Right after that, recovery was successfully completed but ceph reports >> were a bit inconsistent. ceph -s was reporting 7 unfound objects, while >> ceph health detail didn't report which PGs those unfound objects >> belonged to. Sage pointed us to ceph pg dump, that indeed showed which >> PGs owned those objects (in all PGs, the crashed OSD was a member). >> However, when we listed missing objects on those PGs, they reported none: >> >> { >> "offset": { >> "oid": "", >> "key": "", >> "snapid": 0, >> "hash": 0, >> "max": 0, >> "pool": -9223372036854775808, >> "namespace": "" >> }, >> "num_missing": 0, >> "num_unfound": 0, >> "objects": [], >> "more": 0 >> } >> >> Then we decided to restart the OSDs on those PGs, and the unfound >> objects disappear from ceph -s report. >> >> It may be important to mention that we had four nodes running the OSDs. >> Two nodes with v9.2.0 and another with v9.2.1. Our OSDs were crashing >> apparently because of an issue on v9.2.1. We shared this on the >> ceph-devel list, that were very helpful solving this >> (http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/31123). >> >> Kind regards, >> >> Simon Engelsman >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com