Re: 5 pgs inactive, 5 pgs incomplete

Martin Palma <martin@xxxxxxxx> · Thu, 20 Aug 2020 11:49:48 +0200

Yes we already did that but since the OSD does not exists anymore we
get the following error:

% ceph osd lost 81 --yes-i-really-mean-it
Error ENOENT: osd.81 does not exist

So we do not know how we can bring the PGs to notice that OSD 81 does
not exist anymore...

On Thu, Aug 20, 2020 at 11:41 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
>
> Did you already mark osd.81 as lost?
>
> AFAIU you need to `ceph osd lost 81`, and *then* you can try the
> osd_find_best_info_ignore_history_les option.
>
> -- dan
>
>
> On Thu, Aug 20, 2020 at 11:31 AM Martin Palma <martin@xxxxxxxx> wrote:
> >
> > All inactive and incomplete PGs are blocked by OSD 81 which does not
> > exist anymore:
> > ...
> > "down_osds_we_would_probe": [
> >                 81
> >             ],
> >             "peering_blocked_by": [],
> >             "peering_blocked_by_detail": [
> >                 {
> >                     "detail": "peering_blocked_by_history_les_bound"
> >                 }
> >             ]
> > ...
> >
> > Here the full output: https://pastebin.com/V5EPZ0N7
> >
> > On Thu, Aug 20, 2020 at 10:58 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
> > >
> > > Something else to help debugging is
> > >
> > > ceph pg 17.173 query
> > >
> > > at the end it should say why the pg is incomplete.
> > >
> > > -- dan
> > >
> > >
> > >
> > > On Thu, Aug 20, 2020 at 10:01 AM Eugen Block <eblock@xxxxxx> wrote:
> > > >
> > > > Hi Martin,
> > > >
> > > > have you seen this blog post [1]? It describes how to recover from
> > > > inactive and incomplete PGs (on a size 1 pool). I haven't tried any of
> > > > that but it could be worth a try. Apparently it only would work if the
> > > > affected PGs have 0 objects but that seems to be the case, right?
> > > >
> > > > Regards,
> > > > Eugen
> > > >
> > > > [1]
> > > > https://medium.com/opsops/recovering-ceph-from-reduced-data-availability-3-pgs-inactive-3-pgs-incomplete-b97cbcb4b5a1
> > > >
> > > >
> > > > Zitat von Martin Palma <martin@xxxxxxxx>:
> > > >
> > > > > If Ceph consultants are reading this please feel free to contact me
> > > > > off list. We are seeking for someone who can help us of course we will
> > > > > pay.
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Aug 17, 2020 at 12:50 PM Martin Palma <martin@xxxxxxxx> wrote:
> > > > >>
> > > > >> After doing some research I suspect the problem is that during the
> > > > >> cluster was backfilling an OSD was removed.
> > > > >>
> > > > >> Now the PGs which are inactive and incomplete have all the same
> > > > >> (removed OSD) in the "down_osds_we_would_probe" output and the peering
> > > > >> is blocked by "peering_blocked_by_history_les_bound". We tried to set
> > > > >> the "osd_find_best_info_ignore_history_les = true" but with no success
> > > > >> the OSDs keep in a peering loop.
> > > > >>
> > > > >> On Mon, Aug 17, 2020 at 9:53 AM Martin Palma <martin@xxxxxxxx> wrote:
> > > > >> >
> > > > >> > Here is the output with all OSD up and running.
> > > > >> >
> > > > >> > ceph -s: https://pastebin.com/5tMf12Lm
> > > > >> > ceph health detail: https://pastebin.com/avDhcJt0
> > > > >> > ceph osd tree: https://pastebin.com/XEB0eUbk
> > > > >> > ceph osd pool ls detail: https://pastebin.com/ShSdmM5a
> > > > >> >
> > > > >> > On Mon, Aug 17, 2020 at 9:38 AM Martin Palma <martin@xxxxxxxx> wrote:
> > > > >> > >
> > > > >> > > Hi Peter,
> > > > >> > >
> > > > >> > > On the weekend another host was down due to power problems, which was
> > > > >> > > restarted. Therefore these outputs also include some "Degraded data
> > > > >> > > redundancy" messages. And one OSD crashed due to a disk error.
> > > > >> > >
> > > > >> > > ceph -s: https://pastebin.com/Tm8QHp52
> > > > >> > > ceph health detail: https://pastebin.com/SrA7Bivj
> > > > >> > > ceph osd tree: https://pastebin.com/nBK8Uafd
> > > > >> > > ceph osd pool ls detail: https://pastebin.com/kYyCb7B2
> > > > >> > >
> > > > >> > > No it's not a EC pool which has the inactive+incomplete PGs.
> > > > >> > >
> > > > >> > > ceph osd crush dump | jq '[.rules, .tunables]':
> > > > >> https://pastebin.com/gqDTjfat
> > > > >> > >
> > > > >> > > Best,
> > > > >> > > Martin
> > > > >> > >
> > > > >> > > On Sun, Aug 16, 2020 at 3:44 PM Peter Maloney
> > > > >> > > <peter.maloney@xxxxxxxxxxxxxxxxxxxx> wrote:
> > > > >> > > >
> > > > >> > > > Dear Martin,
> > > > >> > > >
> > > > >> > > > Can you provide some details?
> > > > >> > > >
> > > > >> > > > ceph -s
> > > > >> > > > ceph health detail
> > > > >> > > > ceph osd tree
> > > > >> > > > ceph osd pool ls detail
> > > > >> > > >
> > > > >> > > > If it's EC (you implied it's not) also show the crush
> > > > >> rules...and may as well include tunables (because greatly raising
> > > > >> choose_total_tries, eg. 200 may be the solution to your problem):
> > > > >> > > > ceph osd crush dump | jq '[.rules, .tunables]'
> > > > >> > > >
> > > > >> > > > Peter
> > > > >> > > >
> > > > >> > > > On 8/16/20 1:18 AM, Martin Palma wrote:
> > > > >> > > > > Yes, but that didn’t help. After some time they have
> > > > >> blocked requests again
> > > > >> > > > > and remain inactive and incomplete.
> > > > >> > > > >
> > > > >> > > > > On Sat, 15 Aug 2020 at 16:58, <ceph@xxxxxxxxxx> wrote:
> > > > >> > > > >
> > > > >> > > > >> Did you tried to restart the sayed osds?
> > > > >> > > > >>
> > > > >> > > > >>
> > > > >> > > > >>
> > > > >> > > > >> Hth
> > > > >> > > > >>
> > > > >> > > > >> Mehmet
> > > > >> > > > >>
> > > > >> > > > >>
> > > > >> > > > >>
> > > > >> > > > >> Am 12. August 2020 21:07:55 MESZ schrieb Martin Palma
> > > > >> <martin@xxxxxxxx>:
> > > > >> > > > >>
> > > > >> > > > >>>> Are the OSDs online? Or do they refuse to boot?
> > > > >> > > > >>> Yes. They are up and running and not marked as down or out of the
> > > > >> > > > >>> cluster.
> > > > >> > > > >>>> Can you list the data with ceph-objectstore-tool on these OSDs?
> > > > >> > > > >>> If you mean the "list" operation on the PG works if an output for
> > > > >> > > > >>> example:
> > > > >> > > > >>> $ ceph-objectstore-tool --data-path
> > > > >> /var/lib/ceph/osd/ceph-63 --pgid
> > > > >> > > > >>> 22.11a --op list
> > > > >> > > > >>
> > > > >> > > > >>>
> > > > >> ["22.11a",{"oid":"1001c1ee04f.00000007","key":"","snapid":-2,"hash":3825189146,"max":0,"pool":22,"namespace":"","max":0}]
> > > > >> > > > >>
> > > > >> > > > >>>
> > > > >> ["22.11a",{"oid":"1000448667f.00000000","key":"","snapid":-2,"hash":4294951194,"max":0,"pool":22,"namespace":"","max":0}]
> > > > >> > > > >>> ...
> > > > >> > > > >>> If I run "ceph pg ls incomplete" in the output only one PG has
> > > > >> > > > >>> objects... all others have 0 objects.
> > > > >> > > > >>> _______________________________________________
> > > > >> > > > >>> ceph-users mailing list -- ceph-users@xxxxxxx
> > > > >> > > > >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > > > >> > > > >> _______________________________________________
> > > > >> > > > >>
> > > > >> > > > >> ceph-users mailing list -- ceph-users@xxxxxxx
> > > > >> > > > >>
> > > > >> > > > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > > > >> > > > >>
> > > > >> > > > >>
> > > > >> > > > > _______________________________________________
> > > > >> > > > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > > >> > > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > --
> > > > >> > > > --------------------------------------------
> > > > >> > > > Peter Maloney
> > > > >> > > > Brockmann Consult GmbH
> > > > >> > > > www.brockmann-consult.de
> > > > >> > > > Chrysanderstr. 1
> > > > >> > > > D-21029 Hamburg, Germany
> > > > >> > > > Tel: +49 (0)40 69 63 89 - 320
> > > > >> > > > E-mail: peter.maloney@xxxxxxxxxxxxxxxxxxxx
> > > > >> > > > Amtsgericht Hamburg HRB 157689
> > > > >> > > > Geschäftsführer Dr. Carsten Brockmann
> > > > >> > > > --------------------------------------------
> > > > >> > > >
> > > > > _______________________________________________
> > > > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > > >
> > > >
> > > > _______________________________________________
> > > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx