Hi, Cephers! We have an issue on a Firefly production cluster: after a disk error, one osd was out of the cluster. During a half of a hour, xfs async write tried to commit xfs journal to a bad disk and a whole node get down with "BUG: cpu## soft lockup". We suspect, that it can be a bug or strange interaction between xfs code, lsi code and lsi firmware. But as the cluster is in production, the investigation of the root cause will be later. We restart node, rejoin 8 of 10 OSDs to cluster and watching recovery process during monday. Оne disk was physically dead and kicked out from RAID (we use single disk RAID0 as OSD), the other lost its RAID metadata, the seems to be really strange. But the issue is that now, after recovery almost comleted, we still have one PG in active+degraded state during ~12 hours, and ceph doesnt try to recover it. In PG query we can see only two osds. Restaring OSD makes this PG incomplete for some time, as we run pool with size 3 min_size 2, and, after rejoining OSD, PG returning in active+degraded state without any attempt to backfill to 3 copies. What can you advise me to do with this PG to complete recovery? Result of pg query: http://pastebin.com/krfELqMs Megov Igor CIO, Yuterra _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com