> Op 11 mei 2016 om 15:51 schreef Simon Engelsman <simon@xxxxxxxxxxxx>: > > > Hello everyone, > > We experienced a strange scenario last week of unfound objects and > inconsistent reports from ceph tools. We solved it with the help from > Sage, and we wanted to share our experience and to see if it can be of > any use for developers too. > Looks like you went through some weird issues! > After OSDs segfaulting randomly, our cluster ended up with one OSD down > and unfound objects, probably due to a combination of inopportune > crashes. We tried to start that OSD again, but it crashed when reading a > specific PG from the log. here: http://pastebin.com/u9WFJnMR > > Sage pointed that it looked like some metadata was corrupted. Funny > thing is that, that PG didn't belong to that OSD anymore. Once we made > sure it didn't belong to that OSD, we removed the PG from that OSD. > > ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-51/ --pgid > 2.1fd --op remove --journal-path /var/lib/ceph/osd/ceph-51/journal > > We had to repeat this procedure for other PGs on that same OSD, as it > kept on crashing on startup. Finally the OSD was up and in, but the > recovery process was stuck with 10 unfound objects. We deleted marking > them as lost in their PGs doing: > Wasn't this specific osd, number 51, also involved in the issues you had earlier? > ceph pg 2.481 mark_unfound_lost delete > > Right after that, recovery was successfully completed but ceph reports > were a bit inconsistent. ceph -s was reporting 7 unfound objects, while > ceph health detail didn't report which PGs those unfound objects > belonged to. Sage pointed us to ceph pg dump, that indeed showed which > PGs owned those objects (in all PGs, the crashed OSD was a member). > However, when we listed missing objects on those PGs, they reported none: > > { > "offset": { > "oid": "", > "key": "", > "snapid": 0, > "hash": 0, > "max": 0, > "pool": -9223372036854775808, > "namespace": "" > }, > "num_missing": 0, > "num_unfound": 0, > "objects": [], > "more": 0 > } > > Then we decided to restart the OSDs on those PGs, and the unfound > objects disappear from ceph -s report. > > It may be important to mention that we had four nodes running the OSDs. > Two nodes with v9.2.0 and another with v9.2.1. Our OSDs were crashing > apparently because of an issue on v9.2.1. We shared this on the > ceph-devel list, that were very helpful solving this > (http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/31123). > > Kind regards, > > Simon Engelsman > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com