Degraded PG dont recover properly

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi, Cephers!

We have an issue on a Firefly production cluster: after a disk error, one osd was out of
the cluster. During a half of a hour, xfs async write tried to commit xfs journal to a
bad disk and a whole node get down with "BUG: cpu## soft lockup". We suspect, that it can be
a bug or strange interaction between xfs code, lsi code and lsi firmware. But as the cluster
is in production, the investigation of the root cause will be later.
We restart node, rejoin 8 of 10 OSDs to cluster and watching recovery process during monday.
Оne disk was physically dead and kicked out from RAID (we use single disk RAID0 as OSD), 
the other lost its RAID metadata, the seems to be really strange.
But the issue is that now, after recovery almost comleted, we still have one PG in active+degraded
state during ~12 hours, and ceph doesnt try to recover it. In PG query we can see only two osds.
Restaring OSD makes this PG incomplete for some time, as we run pool with size 3 min_size 2,
and, after rejoining OSD, PG returning in active+degraded state without any attempt to
backfill to 3 copies. What can you advise me to do with this PG to complete recovery?

Result of pg query:

http://pastebin.com/krfELqMs

Megov Igor
CIO, Yuterra
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux