Re: Unfound objects and inconsistent status reports

Simon Engelsman <simon@xxxxxxxxxxxx> · Wed, 11 May 2016 17:25:10 +0200



On 05/11/2016 04:00 PM, Wido den Hollander wrote:
> 
>> Op 11 mei 2016 om 15:51 schreef Simon Engelsman <simon@xxxxxxxxxxxx>:
>>
>>
>> Hello everyone,
>>
>> We experienced a strange scenario last week of unfound objects and
>> inconsistent reports from ceph tools. We solved it with the help from
>> Sage, and we wanted to share our experience and to see if it can be of
>> any use for developers too.
>>
> 
> Looks like you went through some weird issues!
Yes, we had some fun with one of our ceph clusters :)
> 
>> After OSDs segfaulting randomly, our cluster ended up with one OSD down
>> and unfound objects, probably due to a combination of inopportune
>> crashes. We tried to start that OSD again, but it crashed when reading a
>> specific PG from the log. here: http://pastebin.com/u9WFJnMR
>>
>> Sage pointed that it looked like some metadata was corrupted. Funny
>> thing is that, that PG didn't belong to that OSD anymore. Once we made
>> sure it didn't belong to that OSD, we removed the PG from that OSD.
>>
>> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-51/ --pgid
>> 2.1fd --op remove --journal-path /var/lib/ceph/osd/ceph-51/journal
>>
>> We had to repeat this procedure for other PGs on that same OSD, as it
>> kept on crashing on startup. Finally the OSD was up and in, but the
>> recovery process was stuck with 10 unfound objects. We deleted marking
>> them as lost in their PGs doing:
>>
> 
> Wasn't this specific osd, number 51, also involved in the issues you had earlier?
No, actually it was a different cluster.
> 
>> ceph pg 2.481 mark_unfound_lost delete
>>
>> Right after that, recovery was successfully completed but ceph reports
>> were a bit inconsistent. ceph -s was reporting 7 unfound objects, while
>> ceph health detail didn't report which PGs those unfound objects
>> belonged to. Sage pointed us to ceph pg dump, that indeed showed which
>> PGs owned those objects (in all PGs, the crashed OSD was a member).
>> However, when we listed missing objects on those PGs, they reported none:
>>
>> {
>>     "offset": {
>>         "oid": "",
>>         "key": "",
>>         "snapid": 0,
>>         "hash": 0,
>>         "max": 0,
>>         "pool": -9223372036854775808,
>>         "namespace": ""
>>     },
>>     "num_missing": 0,
>>     "num_unfound": 0,
>>     "objects": [],
>>     "more": 0
>> }
>>
>> Then we decided to restart the OSDs on those PGs, and the unfound
>> objects disappear from ceph -s report.
>>
>> It may be important to mention that we had four nodes running the OSDs.
>> Two nodes with v9.2.0 and another with v9.2.1. Our OSDs were crashing
>> apparently because of an issue on v9.2.1. We shared this on the
>> ceph-devel list, that were very helpful solving this
>> (http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/31123).
>>
>> Kind regards,
>>
>> Simon Engelsman
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com