Hi,
We have a cluster where we mix HDDs and NVMe drives using device classes
with a specific crush role for each class.
One of our NVMe drives physically died which caused some of our PGs to
go into this state:
pg 26.ac is stuck undersized for 60830.991784, current state
activating+undersized+degraded+remapped, last acting
[353,373,368,377,2147483647,350]
pg 26.d1 is stuck undersized for 60830.587711, current state
activating+undersized+degraded+remapped, last acting
[343,2147483647,347,358,366,355]
pg 26.e1 is stuck undersized for 60830.980585, current state
activating+undersized+degraded+remapped, last acting
[340,349,370,2147483647,360,376]
... and so on.
Recovery never happened and we had to manually restart all affected OSDs
for all PGs stuck in such a state.
The 2^31-1 in there seems to indicate an overflow somewhere - the way we
were able to figure out where exactly
is to query the PG and compare the "up" and "acting" sets - only _one_
of them had the 2^31-1 number in place
of the correct OSD number. We restarted that and the PG started doing
its job and recovered.
The issue seems to be going back to 2015:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-May/001661.html
however no solution...
I'm more concerned about the cluster not being able to recover (it's a
4+2 EC pool across 12 hosts - plenty of room
to heal) than about the weird print-out.
The VMs who wanted to access data in any of the affected PGs of course
died.
Are we missing some settings to let the cluster self-heal even for EC
pools? First EC pool in production :)
Cheers,
Zoltan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx