Re: Ceph PGs stuck inactive after rebuild node

Zakhar Kirpichenko <zakhar@xxxxxxxxx> · Wed, 6 Apr 2022 18:29:19 +0300

Thanks everyone!

/Zakhar

On Wed, Apr 6, 2022 at 6:24 PM Josh Baergen <jbaergen@xxxxxxxxxxxxxxxx>
wrote:

> For future reference, "ceph pg repeer <pgid>" might have helped here.
>
> Was the PG stuck in the "activating" state? If so, I wonder if you
> temporarily exceeded mon_max_pg_per_osd on some OSDs when rebuilding
> your host. At least on Nautilus I've seen cases where Ceph doesn't
> gracefully recover from this temporary limit violation and the PGs
> need some nudges to become active.
>
> Josh
>
> On Wed, Apr 6, 2022 at 9:02 AM Eugen Block <eblock@xxxxxx> wrote:
> >
> > Sure, from the output of 'ceph pg map <PG>' you get the acting set,
> > for example:
> >
> > cephadmin:~ # ceph pg map 32.18
> > osdmap e7198 pg 32.18 (32.18) -> up [9,2,1] acting [9,2,1]
> >
> > Then I restarted OSD.9 and the inactive PG became active again.
> > I remember this has been discussed a couple of times in the past on
> > this list, but I'm wondering if this still happens in newer releases.
> > I assume there's no way of preventing that, so we'll probably go with
> > the safe approach on the next node. It's a production cluster and this
> > incident was not expected, of course. At least we got it back online.
> >
> >
> > Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>:
> >
> > > Hi Eugen,
> > >
> > > Can you please elaborate on what you mean by "restarting the primary
> PG"?
> > >
> > > Best regards,
> > > Zakhar
> > >
> > > On Wed, Apr 6, 2022 at 5:15 PM Eugen Block <eblock@xxxxxx> wrote:
> > >
> > >> Update: Restarting the primary PG helped to bring the PGs back to
> > >> active state. Consider this thread closed.
> > >>
> > >>
> > >> Zitat von Eugen Block <eblock@xxxxxx>:
> > >>
> > >> > Hi all,
> > >> >
> > >> > I have a strange situation here, a Nautilus cluster with two DCs,
> > >> > the main pool is an EC pool with k7 m11, min_size = 8 (failure
> > >> > domain host). We confirmed failure resiliency multiple times for
> > >> > this cluster, today we rebuilt one node resulting in currently 34
> > >> > inactive PGs. I'm wondering why they are inactive though. It's quite
> > >> > urgent and I'd like to get the PGs active again. Before rebuilding
> > >> > we didn't drain it though, but this procedure has worked multiple
> > >> > times in the past.
> > >> > I haven't done too much damage yet, except for trying to force the
> > >> > backfill of one PG (ceph pg force-backfill <PG>) to no avail yet.
> > >> > Any pointers are highly appreciated!
> > >> >
> > >> > Regards,
> > >> > Eugen
> > >>
> > >>
> > >>
> > >> _______________________________________________
> > >> ceph-users mailing list -- ceph-users@xxxxxxx
> > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > >>
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx