Re: PG's incomplete after OSD failure

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Just an update, it appears that no data actually exists for those PG's
on osd.117 and osd.111 but it's showing as incomplete anyway.

So for the 8.ca PG, osd.111 has only an empty directory but osd 190 is
filled with data.
For 8.6ae, osd.117 has no data in the pg directory and osd.190 is
filled with data as before.

Since all of the required data is on OSD.190, would there be a way to
make osd.111 and osd.117 forget they have ever seen the two incomplete
PG's and therefore restart backfilling?


On Tue, Nov 11, 2014 at 10:37 AM, Matthew Anderson
<manderson8787@xxxxxxxxx> wrote:
> Hi All,
>
> We've had a string of very unfortunate failures and need a hand fixing
> the incomplete PG's that we're now left with. We're configured with 3
> replicas over different hosts with 5 in total.
>
> The timeline goes -
> -1 week  :: A full server goes offline with a failed backplane. Still
> not working
> -1 day  ::  OSD 190 fails
> -1 day + 3 minutes :: OSD 121 fails in a different server fails taking
> out several PG's and blocking IO
> Today  :: The first failed osd (osd.190) was cloned to a good drive
> with xfs_dump | xfs_restore and now boots fine. The last failed osd
> (osd.121) is completely unrecoverable and was marked as lost.
>
> What we're left with now is 2 incomplete PG's that are preventing RBD
> images from booting.
>
> # ceph pg dump_stuck inactive
> ok
> pg_stat    objects    mip    degr    misp    unf    bytes    log
> disklog    state    state_stamp    v    reported    up    up_primary
>  acting    acting_primary    last_scrub    scrub_stamp
> last_deep_scrub    deep_scrub_stamp
> 8.ca    2440    0    0    0    0    10219748864    9205    9205
> incomplete    2014-11-11 10:29:04.910512    160435'959618
> 161358:6071679    [190,111]    190    [190,111]    190    86417'207324
>    2013-09-09 12:58:10.749001    86229'196887    2013-09-02
> 12:57:58.162789
> 8.6ae    0    0    0    0    0    0    3176    3176    incomplete
> 2014-11-11 10:24:07.000373    160931'1935986    161358:267
> [117,190]    117    [117,190]    117    86424'389748    2013-09-09
> 16:52:58.796650    86424'389748    2013-09-09 16:52:58.796650
>
> We've tried doing a pg revert but it's saying 'no missing objects'
> followed by not doing anything. I've also done the usual scrub,
> deep-scrub, pg and osd repairs... so far nothing has helped.
>
> I think it could be a similar situation to this post [
> http://www.spinics.net/lists/ceph-users/msg11461.html ] where one of
> the osd's it holding a slightly newer but incomplete version of the PG
> which needs to be removed. Is anyone able to shed some light on how I
> might be able to use the objectstore tool to check if this is the
> case?
>
> If anyone has any suggestions it would be greatly appreciated.
> Likewise if you need any more information about my problem just let me
> know
>
> Thanks all
> -Matt
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux