Re: EC PGs stuck activating, 2^31-1 as OSD ID, automatic recovery not kicking in

Zoltan Arnold Nagy <zoltan@xxxxxxxxxxxxxxxxxx> · Fri, 22 Nov 2019 22:26:48 +0100

On 2019-11-22 21:45, Paul Emmerich wrote:
On Fri, Nov 22, 2019 at 9:33 PM Zoltan Arnold Nagy
<zoltan@xxxxxxxxxxxxxxxxxx> wrote:

The 2^31-1 in there seems to indicate an overflow somewhere - the way 
we
were able to figure out where exactly
is to query the PG and compare the "up" and "acting" sets - only _one_
of them had the 2^31-1 number in place
of the correct OSD number. We restarted that and the PG started doing
its job and recovered.

no, this value is intentional (and shows up as 'None' on higher level
tools), it means no mapping could be found

thanks, didn't know.

check your crush map and crush rule

if it were indeed a crush rule or map issue, it would not have been
resolved by just restarting the primary OSD of the PG, would it?

the crush rule was created by running

ceph osd erasure-code-profile set ec42 k=4 m=2 crush-device-class=nvme

where the default failure domain is host; as I said we have 12 hosts,
so I don't see anything wrong here - it's all default...

this is why I suspect a bug, just don't have any evidence other than
that it happened to us :)
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx