Re: Ceph PGs stuck inactive after rebuild node

Josh Baergen <jbaergen@xxxxxxxxxxxxxxxx> · Wed, 6 Apr 2022 09:24:36 -0600

For future reference, "ceph pg repeer <pgid>" might have helped here.

Was the PG stuck in the "activating" state? If so, I wonder if you
temporarily exceeded mon_max_pg_per_osd on some OSDs when rebuilding
your host. At least on Nautilus I've seen cases where Ceph doesn't
gracefully recover from this temporary limit violation and the PGs
need some nudges to become active.

Josh

On Wed, Apr 6, 2022 at 9:02 AM Eugen Block <eblock@xxxxxx> wrote:
>
> Sure, from the output of 'ceph pg map <PG>' you get the acting set,
> for example:
>
> cephadmin:~ # ceph pg map 32.18
> osdmap e7198 pg 32.18 (32.18) -> up [9,2,1] acting [9,2,1]
>
> Then I restarted OSD.9 and the inactive PG became active again.
> I remember this has been discussed a couple of times in the past on
> this list, but I'm wondering if this still happens in newer releases.
> I assume there's no way of preventing that, so we'll probably go with
> the safe approach on the next node. It's a production cluster and this
> incident was not expected, of course. At least we got it back online.
>
>
> Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>:
>
> > Hi Eugen,
> >
> > Can you please elaborate on what you mean by "restarting the primary PG"?
> >
> > Best regards,
> > Zakhar
> >
> > On Wed, Apr 6, 2022 at 5:15 PM Eugen Block <eblock@xxxxxx> wrote:
> >
> >> Update: Restarting the primary PG helped to bring the PGs back to
> >> active state. Consider this thread closed.
> >>
> >>
> >> Zitat von Eugen Block <eblock@xxxxxx>:
> >>
> >> > Hi all,
> >> >
> >> > I have a strange situation here, a Nautilus cluster with two DCs,
> >> > the main pool is an EC pool with k7 m11, min_size = 8 (failure
> >> > domain host). We confirmed failure resiliency multiple times for
> >> > this cluster, today we rebuilt one node resulting in currently 34
> >> > inactive PGs. I'm wondering why they are inactive though. It's quite
> >> > urgent and I'd like to get the PGs active again. Before rebuilding
> >> > we didn't drain it though, but this procedure has worked multiple
> >> > times in the past.
> >> > I haven't done too much damage yet, except for trying to force the
> >> > backfill of one PG (ceph pg force-backfill <PG>) to no avail yet.
> >> > Any pointers are highly appreciated!
> >> >
> >> > Regards,
> >> > Eugen
> >>
> >>
> >>
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>
>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx