EC PGs stuck activating, 2^31-1 as OSD ID, automatic recovery not kicking in

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

We have a cluster where we mix HDDs and NVMe drives using device classes with a specific crush role for each class.

One of our NVMe drives physically died which caused some of our PGs to go into this state:

pg 26.ac is stuck undersized for 60830.991784, current state activating+undersized+degraded+remapped, last acting [353,373,368,377,2147483647,350] pg 26.d1 is stuck undersized for 60830.587711, current state activating+undersized+degraded+remapped, last acting [343,2147483647,347,358,366,355] pg 26.e1 is stuck undersized for 60830.980585, current state activating+undersized+degraded+remapped, last acting [340,349,370,2147483647,360,376]
... and so on.

Recovery never happened and we had to manually restart all affected OSDs for all PGs stuck in such a state.

The 2^31-1 in there seems to indicate an overflow somewhere - the way we were able to figure out where exactly is to query the PG and compare the "up" and "acting" sets - only _one_ of them had the 2^31-1 number in place of the correct OSD number. We restarted that and the PG started doing its job and recovered.

The issue seems to be going back to 2015: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-May/001661.html
however no solution...

I'm more concerned about the cluster not being able to recover (it's a 4+2 EC pool across 12 hosts - plenty of room
to heal) than about the weird print-out.

The VMs who wanted to access data in any of the affected PGs of course died.

Are we missing some settings to let the cluster self-heal even for EC pools? First EC pool in production :)

Cheers,
Zoltan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux