Re: Unfound objects and inconsistent status reports

Wido den Hollander <wido@xxxxxxxx> · Wed, 11 May 2016 16:00:55 +0200 (CEST)

> Op 11 mei 2016 om 15:51 schreef Simon Engelsman <simon@xxxxxxxxxxxx>:
> 
> 
> Hello everyone,
> 
> We experienced a strange scenario last week of unfound objects and
> inconsistent reports from ceph tools. We solved it with the help from
> Sage, and we wanted to share our experience and to see if it can be of
> any use for developers too.
> 

Looks like you went through some weird issues!

> After OSDs segfaulting randomly, our cluster ended up with one OSD down
> and unfound objects, probably due to a combination of inopportune
> crashes. We tried to start that OSD again, but it crashed when reading a
> specific PG from the log. here: http://pastebin.com/u9WFJnMR
> 
> Sage pointed that it looked like some metadata was corrupted. Funny
> thing is that, that PG didn't belong to that OSD anymore. Once we made
> sure it didn't belong to that OSD, we removed the PG from that OSD.
> 
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-51/ --pgid
> 2.1fd --op remove --journal-path /var/lib/ceph/osd/ceph-51/journal
> 
> We had to repeat this procedure for other PGs on that same OSD, as it
> kept on crashing on startup. Finally the OSD was up and in, but the
> recovery process was stuck with 10 unfound objects. We deleted marking
> them as lost in their PGs doing:
> 

Wasn't this specific osd, number 51, also involved in the issues you had earlier?

> ceph pg 2.481 mark_unfound_lost delete
> 
> Right after that, recovery was successfully completed but ceph reports
> were a bit inconsistent. ceph -s was reporting 7 unfound objects, while
> ceph health detail didn't report which PGs those unfound objects
> belonged to. Sage pointed us to ceph pg dump, that indeed showed which
> PGs owned those objects (in all PGs, the crashed OSD was a member).
> However, when we listed missing objects on those PGs, they reported none:
> 
> {
>     "offset": {
>         "oid": "",
>         "key": "",
>         "snapid": 0,
>         "hash": 0,
>         "max": 0,
>         "pool": -9223372036854775808,
>         "namespace": ""
>     },
>     "num_missing": 0,
>     "num_unfound": 0,
>     "objects": [],
>     "more": 0
> }
> 
> Then we decided to restart the OSDs on those PGs, and the unfound
> objects disappear from ceph -s report.
> 
> It may be important to mention that we had four nodes running the OSDs.
> Two nodes with v9.2.0 and another with v9.2.1. Our OSDs were crashing
> apparently because of an issue on v9.2.1. We shared this on the
> ceph-devel list, that were very helpful solving this
> (http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/31123).
> 
> Kind regards,
> 
> Simon Engelsman
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com