Re: EC PGs stuck activating, 2^31-1 as OSD ID, automatic recovery not kicking in

Paul Emmerich <paul.emmerich@xxxxxxxx> · Fri, 22 Nov 2019 21:45:33 +0100

On Fri, Nov 22, 2019 at 9:33 PM Zoltan Arnold Nagy
<zoltan@xxxxxxxxxxxxxxxxxx> wrote:

> The 2^31-1 in there seems to indicate an overflow somewhere - the way we
> were able to figure out where exactly
> is to query the PG and compare the "up" and "acting" sets - only _one_
> of them had the 2^31-1 number in place
> of the correct OSD number. We restarted that and the PG started doing
> its job and recovered.

no, this value is intentional (and shows up as 'None' on higher level
tools), it means no mapping could be found; check your crush map and
crush rule

Paul

>
> The issue seems to be going back to 2015:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-May/001661.html
> however no solution...
>
> I'm more concerned about the cluster not being able to recover (it's a
> 4+2 EC pool across 12 hosts - plenty of room
> to heal) than about the weird print-out.
>
> The VMs who wanted to access data in any of the affected PGs of course
> died.
>
> Are we missing some settings to let the cluster self-heal even for EC
> pools? First EC pool in production :)
>
> Cheers,
> Zoltan
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx