Re: One PG stuck in active+clean+remapped

Dan van der Ster <dvanders@xxxxxxxxx> · Thu, 24 Feb 2022 20:25:25 +0100

Hi Erwin,

This may be one of the rare cases where the default choose_total_tries
= 50 is too low.
You can try increasing it to 75 or 100 and see if crush can find 3 up OSDs.

Here's the basic recipe:

# ceph osd getcrushmap -o crush.map
# crushtool -d crush.map -o crush.txt
# vi crush.txt  # and change to tunable choose_total_tries 100
# crushtool -c crush.txt -o crush.map2
# ceph osd setcrushmap -i crush.map2

Cheers, dan

On Thu, Feb 24, 2022 at 6:29 PM Erwin Lubbers <erwin@xxxxxxxxxxx> wrote:
>
> Hi all,
>
> I have one active+clean+remapped PG on a 152 OSD Octopus (15.2.15) cluster with equal balanced OSD's (around 40% usage). The cluster has three replicas spreaded around three datacenters (A+B+C).
>
> All PGs are available in each datacenter (as defined in the crush map), but only this one (which is in a pool containing 2048 PGs) is up on OSD.34 and OSD.42 and acting on OSD.34, OSD.42 and OSD.38.
>
> OSD.34 is located in datacenter A, 42 in B and 38 in A again, but it should be in C.
>
> I did restart all OSD's, monitors, managers and servers. I did out the OSDs that the PG is acting on and bring it back in a minute later. In all cases the PG holds the same state after backfilling, but one of the A replicas switches to another OSD in the A datacenter. I did turn off and on the balancer. But nothing seems to recover the PG to active+clean.
>
> Any suggestions?
>
> Regards,
> Erwin
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx