Re: PG's incomplete after OSD failure

Matthew Anderson <manderson8787@xxxxxxxxx> · Wed, 12 Nov 2014 03:08:27 +0800

I've done a bit more work tonight and managed to get some more data
back. Osd.121, which was previously completely dead, has made it
through an XFS repair with a more fault tolerant HBA firmware and I
was able to export both of the placement groups required using
ceph_objectstore_tool. The osd would probably boot if I hadn't already
marked it as lost :(

I've basically got it down to two options.

I can import the exported data from osd.121 into osd.190 which would
complete the PG but this fails with a filestore feature mismatch
because the sharded objects feature is missing on the target osd.
Export has incompatible features set
compat={},rocompat={},incompat={1=initial feature set(~v.18),2=pginfo
object,3=object
locator,4=last_epoch_clean,5=categories,6=hobjectpool,7=biginfo,8=leveldbinfo,9=leveldblog,10=snapmapper,11=sharded
objects,12=transaction hints}

The second one would be to run a ceph pg force_create_pg on each of
the problem PG's to reset them back to empty and them import the data
using ceph_objectstore_tool import-rados. Unfortunately this has
failed as well when I tested ceph pg force_create_pg on an incomplete
PG in another pool. The PG gets set to creating but then goes back to
incomplete after a few minutes.

I've trawled the mailing list for solutions but have come up empty,
neither problem appears to have been resolved before.

On Tue, Nov 11, 2014 at 5:54 PM, Matthew Anderson
<manderson8787@xxxxxxxxx> wrote:
> Thanks for your reply Sage!
>
> I've tested with 8.6ae and no luck I'm afraid. Steps taken were -
> Stop osd.117
> Export 8.6ae from osd.117
> Remove 8.6ae from osd.117
> start osd.117
> restart osd.190 after still showing incomplete
>
> After this the PG was still showing incomplete and ceph pg dump_stuck
> inactive shows -
> pg_stat objects mip degr misp unf bytes log disklog state state_stamp
> v reported up up_primary acting acting_primary last_scrub scrub_stamp
> last_deep_scrub deep_scrub_stamp
> 8.6ae 0 0 0 0 0 0 0 0 incomplete 2014-11-11 17:34:27.168078 0'0
> 161425:40 [117,190] 117 [117,190] 117 86424'389748 2013-09-09
> 16:52:58.796650 86424'389748 2013-09-09 16:52:58.796650
>
> I then tried an export from OSD 190 to OSD 117 by doing -
> Stop osd.190 and osd.117
> Export pg 8.6ae from osd.190
> Import from file generated in previous step into osd.117
> Boot both osd.190 and osd.117
>
> When osd.117 attempts to start it generates an failed assert, full log
> is here http://pastebin.com/S4CXrTAL
> -1> 2014-11-11 17:25:15.130509 7f9f44512900  0 osd.117 161404 load_pgs
>      0> 2014-11-11 17:25:18.604696 7f9f44512900 -1 osd/OSD.h: In
> function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f9f44512900
> time 2014-11-11 17:25:18.602626
> osd/OSD.h: 715: FAILED assert(ret)
>
>  ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x8b) [0xb8231b]
>  2: (OSDService::get_map(unsigned int)+0x3f) [0x6eea2f]
>  3: (OSD::load_pgs()+0x1b78) [0x6aae18]
>  4: (OSD::init()+0x71f) [0x6abf5f]
>  5: (main()+0x252c) [0x638cfc]
>  6: (__libc_start_main()+0xf5) [0x7f9f41650ec5]
>  7: /usr/bin/ceph-osd() [0x651027]
>
> I also attempted the same steps with 8.ca and got the same results.
> The below is the current state of the pg with it removed from osd.111
> -
> pg_stat objects mip degr misp unf bytes log disklog state state_stamp
> v reported up up_primary acting acting_primary last_scrub scrub_stamp
> last_deep_scrub deep_scrub_stamp
> 8.ca 2440 0 0 0 0 10219748864 9205 9205 incomplete 2014-11-11
> 17:39:28.570675 160435'959618 161425:6071759 [190,111] 190 [190,111]
> 190 86417'207324 2013-09-09 12:58:10.749001 86229'196887 2013-09-02
> 12:57:58.162789
>
> Any idea of where I can go from here?
> One thought I had was setting osd.111 and osd.117 out of the cluster
> and once the data is moved I can shut them down and mark them as lost
> which would make osd.190 the only replica available for those PG's.
>
> Thanks again
>
> On Tue, Nov 11, 2014 at 1:10 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>> On Tue, 11 Nov 2014, Matthew Anderson wrote:
>>> Just an update, it appears that no data actually exists for those PG's
>>> on osd.117 and osd.111 but it's showing as incomplete anyway.
>>>
>>> So for the 8.ca PG, osd.111 has only an empty directory but osd 190 is
>>> filled with data.
>>> For 8.6ae, osd.117 has no data in the pg directory and osd.190 is
>>> filled with data as before.
>>>
>>> Since all of the required data is on OSD.190, would there be a way to
>>> make osd.111 and osd.117 forget they have ever seen the two incomplete
>>> PG's and therefore restart backfilling?
>>
>> Ah, that's good news.  You should know that the copy on osd.190 is
>> slightly out of date, but it is much better than losing the entire
>> contents of the PG.  More specifically, for 8.6ae the latest version was
>> 1935986 but the osd.190 is 1935747, about 200 writes in the past.  You'll
>> need to fsck the RBD images after this is all done.
>>
>> I don't think we've tested this recovery scenario, but I think you'll be
>> able to recovery with ceph_objectstore_tool, which has an import/export
>> function and a delete function.  First, try removing the newer version of
>> the pg on osd.117.  First export it for good measure (even tho it's
>> empty):
>>
>> stop the osd
>>
>> ceph_objectstore_tool --data-path /var/lib/ceph/osd/ceph-117  \
>> --journal-path /var/lib/ceph/osd/ceph-117/journal \
>> --op export --pgid 8.6ae --file osd.117.8.7ae
>>
>> ceph_objectstore_tool --data-path /var/lib/ceph/osd/ceph-117  \
>> --journal-path /var/lib/ceph/osd/ceph-117/journal \
>> --op remove --pgid 8.6ae
>>
>> and restart.  If that doesn't peer, you can also try exporting the pg from
>> osd.190 and importing it into osd.117.  I think just removing the
>> newer empty pg on osd.117 will do the trick, though...
>>
>> sage
>>
>>
>>
>>>
>>>
>>> On Tue, Nov 11, 2014 at 10:37 AM, Matthew Anderson
>>> <manderson8787@xxxxxxxxx> wrote:
>>> > Hi All,
>>> >
>>> > We've had a string of very unfortunate failures and need a hand fixing
>>> > the incomplete PG's that we're now left with. We're configured with 3
>>> > replicas over different hosts with 5 in total.
>>> >
>>> > The timeline goes -
>>> > -1 week  :: A full server goes offline with a failed backplane. Still
>>> > not working
>>> > -1 day  ::  OSD 190 fails
>>> > -1 day + 3 minutes :: OSD 121 fails in a different server fails taking
>>> > out several PG's and blocking IO
>>> > Today  :: The first failed osd (osd.190) was cloned to a good drive
>>> > with xfs_dump | xfs_restore and now boots fine. The last failed osd
>>> > (osd.121) is completely unrecoverable and was marked as lost.
>>> >
>>> > What we're left with now is 2 incomplete PG's that are preventing RBD
>>> > images from booting.
>>> >
>>> > # ceph pg dump_stuck inactive
>>> > ok
>>> > pg_stat    objects    mip    degr    misp    unf    bytes    log
>>> > disklog    state    state_stamp    v    reported    up    up_primary
>>> >  acting    acting_primary    last_scrub    scrub_stamp
>>> > last_deep_scrub    deep_scrub_stamp
>>> > 8.ca    2440    0    0    0    0    10219748864    9205    9205
>>> > incomplete    2014-11-11 10:29:04.910512    160435'959618
>>> > 161358:6071679    [190,111]    190    [190,111]    190    86417'207324
>>> >    2013-09-09 12:58:10.749001    86229'196887    2013-09-02
>>> > 12:57:58.162789
>>> > 8.6ae    0    0    0    0    0    0    3176    3176    incomplete
>>> > 2014-11-11 10:24:07.000373    160931'1935986    161358:267
>>> > [117,190]    117    [117,190]    117    86424'389748    2013-09-09
>>> > 16:52:58.796650    86424'389748    2013-09-09 16:52:58.796650
>>> >
>>> > We've tried doing a pg revert but it's saying 'no missing objects'
>>> > followed by not doing anything. I've also done the usual scrub,
>>> > deep-scrub, pg and osd repairs... so far nothing has helped.
>>> >
>>> > I think it could be a similar situation to this post [
>>> > http://www.spinics.net/lists/ceph-users/msg11461.html ] where one of
>>> > the osd's it holding a slightly newer but incomplete version of the PG
>>> > which needs to be removed. Is anyone able to shed some light on how I
>>> > might be able to use the objectstore tool to check if this is the
>>> > case?
>>> >
>>> > If anyone has any suggestions it would be greatly appreciated.
>>> > Likewise if you need any more information about my problem just let me
>>> > know
>>> >
>>> > Thanks all
>>> > -Matt
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com