Re: cluster failing to recover

Matyas Koszik <koszik@xxxxxx> · Tue, 5 Jul 2016 17:52:49 +0200 (CEST)

Hi,

The disks died, and were removed by:
ceph osd out $osd
ceph osd lost $osd
ceph osd crush remove $osd
ceph auth del $osd
ceph osd rm $osd

When writing my mails it was after the 'lost' or 'crush remove' step, not
sure. But even the last step didn't fix the issue. It was like this:
http://pastebin.com/UjSjVsJ0

Matyas

On Tue, 5 Jul 2016, Sean Redmond wrote:

> Hi,
>
> What happened to the missing 2 OSD's?
>
> 53 osds: 51 up, 51 in
>
> Thanks
>
> On Tue, Jul 5, 2016 at 4:04 PM, Matyas Koszik <koszik@xxxxxx> wrote:
>
> >
> > Should you be interested, the solution to this was
> > ceph pg $pg mark_unfound_lost delete
> > for all pgs that had unfound objects, now the cluster is back in a healthy
> > state.
> >
> > I think this is very counter-intuitive (why should totally unrelated pgs
> > be affected by this?!) but at least the solution was simple.
> >
> > Matyas
> >
> > On Mon, 4 Jul 2016, Oliver Dzombic wrote:
> >
> > > Hi,
> > >
> > > did you already do something ( replacing drives or changing something ) ?
> > >
> > > You have 11 scrub errors, and ~ 11x inconsistent pg's
> > >
> > > The inconsistent pg's, for example:
> > >
> > > pg 4.3a7 is stuck unclean for 629.766502, current state
> > > active+recovery_wait+degraded+inconsistent, last acting [10,21]
> > >
> > > are not on the down osd's 1 and 22
> > >
> > > neighter of them.
> > >
> > > So the should not be missing. But they are.
> > >
> > > Anyway, i think the next step would be to start a pg_repair command and
> > > see where the road goes.
> > >
> > > --
> > > Mit freundlichen Gruessen / Best regards
> > >
> > > Oliver Dzombic
> > > IP-Interactive
> > >
> > > mailto:info@xxxxxxxxxxxxxxxxx
> > >
> > > Anschrift:
> > >
> > > IP Interactive UG ( haftungsbeschraenkt )
> > > Zum Sonnenberg 1-3
> > > 63571 Gelnhausen
> > >
> > > HRB 93402 beim Amtsgericht Hanau
> > > GeschĂ¤ftsfĂźhrung: Oliver Dzombic
> > >
> > > Steuer Nr.: 35 236 3622 1
> > > UST ID: DE274086107
> > >
> > >
> > > Am 03.07.2016 um 23:59 schrieb Matyas Koszik:
> > > >
> > > > Hi,
> > > >
> > > > I've continued restarting osds in the meantime, and it got somewhat
> > > > better, but still very far from optimal.
> > > >
> > > > Here're the details you requested:
> > > >
> > > > http://pastebin.com/Vqgadz24
> > > >
> > > > http://pastebin.com/vCL6BRvC
> > > >
> > > > Matyas
> > > >
> > > >
> > > > On Sun, 3 Jul 2016, Oliver Dzombic wrote:
> > > >
> > > >> Hi,
> > > >>
> > > >> please provide:
> > > >>
> > > >> ceph health detail
> > > >>
> > > >> ceph osd tree
> > > >>
> > > >> --
> > > >> Mit freundlichen Gruessen / Best regards
> > > >>
> > > >> Oliver Dzombic
> > > >> IP-Interactive
> > > >>
> > > >> mailto:info@xxxxxxxxxxxxxxxxx
> > > >>
> > > >> Anschrift:
> > > >>
> > > >> IP Interactive UG ( haftungsbeschraenkt )
> > > >> Zum Sonnenberg 1-3
> > > >> 63571 Gelnhausen
> > > >>
> > > >> HRB 93402 beim Amtsgericht Hanau
> > > >> GeschĂ¤ftsfĂźhrung: Oliver Dzombic
> > > >>
> > > >> Steuer Nr.: 35 236 3622 1
> > > >> UST ID: DE274086107
> > > >>
> > > >>
> > > >> Am 03.07.2016 um 21:36 schrieb Matyas Koszik:
> > > >>>
> > > >>> Hi,
> > > >>>
> > > >>> I recently upgraded to jewel (10.2.2) and now I'm confronted with a
> > rather
> > > >>> strange behavior: recovey does not progress in the way it should. If
> > I
> > > >>> restart the osds on a host, it'll get a bit better (or worse), like
> > this:
> > > >>>
> > > >>> 50 pgs undersized
> > > >>> recovery 43775/7057285 objects degraded (0.620%)
> > > >>> recovery 87980/7057285 objects misplaced (1.247%)
> > > >>>
> > > >>> [restart osds on node1]
> > > >>>
> > > >>> 44 pgs undersized
> > > >>> recovery 39623/7061519 objects degraded (0.561%)
> > > >>> recovery 92142/7061519 objects misplaced (1.305%)
> > > >>>
> > > >>> [restart osds on node1]
> > > >>>
> > > >>> 43 pgs undersized
> > > >>> 1116 requests are blocked > 32 sec
> > > >>> recovery 38181/7061529 objects degraded (0.541%)
> > > >>> recovery 90617/7061529 objects misplaced (1.283%)
> > > >>>
> > > >>> ...
> > > >>>
> > > >>> The current state is this:
> > > >>>
> > > >>>  osdmap e38804: 53 osds: 51 up, 51 in; 66 remapped pgs
> > > >>>   pgmap v14797137: 4388 pgs, 8 pools, 13626 GB data, 3434 kobjects
> > > >>>         27474 GB used, 22856 GB / 50330 GB avail
> > > >>>         38172/7061565 objects degraded (0.541%)
> > > >>>         90617/7061565 objects misplaced (1.283%)
> > > >>>         8/3517300 unfound (0.000%)
> > > >>>             4202 active+clean
> > > >>>              109 active+recovery_wait+degraded
> > > >>>               38 active+undersized+degraded+remapped+wait_backfill
> > > >>>               15 active+remapped+wait_backfill
> > > >>>               11 active+clean+inconsistent
> > > >>>                8 active+recovery_wait+degraded+remapped
> > > >>>                3 active+recovering+undersized+degraded+remapped
> > > >>>                2 active+recovery_wait+undersized+degraded+remapped
> > > >>>
> > > >>>
> > > >>> All the pools have size=2 min_size=1.
> > > >>>
> > > >>> (All the unfound blocks are on undersized pgs, and I cannot seem to
> > be
> > > >>> able to fix them without having replicas (?). They exist, but are
> > > >>> outdated, from an earlier problem.)
> > > >>>
> > > >>>
> > > >>>
> > > >>> Matyas
> > > >>>
> > > >>>
> > > >>> _______________________________________________
> > > >>> ceph-users mailing list
> > > >>> ceph-users@xxxxxxxxxxxxxx
> > > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >>>
> > > >> _______________________________________________
> > > >> ceph-users mailing list
> > > >> ceph-users@xxxxxxxxxxxxxx
> > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >>
> > > >
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com