EC PGs stuck activating, 2^31-1 as OSD ID, automatic recovery not kicking in

Zoltan Arnold Nagy <zoltan@xxxxxxxxxxxxxxxxxx> · Fri, 22 Nov 2019 21:32:43 +0100

Hi,

We have a cluster where we mix HDDs and NVMe drives using device classes 
with a specific crush role for each class.

One of our NVMe drives physically died which caused some of our PGs to 
go into this state:

pg 26.ac is stuck undersized for 60830.991784, current state 
activating+undersized+degraded+remapped, last acting 
[353,373,368,377,2147483647,350]
pg 26.d1 is stuck undersized for 60830.587711, current state 
activating+undersized+degraded+remapped, last acting 
[343,2147483647,347,358,366,355]
pg 26.e1 is stuck undersized for 60830.980585, current state 
activating+undersized+degraded+remapped, last acting 
[340,349,370,2147483647,360,376]
... and so on.

Recovery never happened and we had to manually restart all affected OSDs 
for all PGs stuck in such a state.

The 2^31-1 in there seems to indicate an overflow somewhere - the way we 
were able to figure out where exactly
is to query the PG and compare the "up" and "acting" sets - only _one_ 
of them had the 2^31-1 number in place
of the correct OSD number. We restarted that and the PG started doing 
its job and recovered.

The issue seems to be going back to 2015: 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-May/001661.html
however no solution...

I'm more concerned about the cluster not being able to recover (it's a 
4+2 EC pool across 12 hosts - plenty of room
to heal) than about the weird print-out.

The VMs who wanted to access data in any of the affected PGs of course 
died.

Are we missing some settings to let the cluster self-heal even for EC 
pools? First EC pool in production :)

Cheers,
Zoltan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx