pg 21.1f9 is stuck inactive for 53316.902820, current state remapped

Lars Täuber <taeuber@xxxxxxx> · Thu, 22 Aug 2019 08:18:19 +0200

Hi all,

we are using ceph in version 14.2.2 from https://mirror.croit.io/debian-nautilus/ on debian buster and experiencing problems with cephfs.

The mounted file system produces hanging processes due to pg stuck inactive. This often happens after I marked single osds out manually.
A typical result is this:

HEALTH_WARN 1 MDSs report slow metadata IOs; 1 MDSs behind on trimming; Reduced data availability: 4 pgs inactive
MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
    mdsmds1(mds.0): 4 slow metadata IOs are blocked > 30 secs, oldest blocked for 51206 secs
MDS_TRIM 1 MDSs behind on trimming
    mdsmds1(mds.0): Behind on trimming (4298/128) max_segments: 128, num_segments: 4298
PG_AVAILABILITY Reduced data availability: 4 pgs inactive
    pg 21.1f9 is stuck inactive for 52858.655306, current state remapped, last acting [8,2147483647,2147483647,26,27,11]
    pg 21.22f is stuck inactive for 52858.636207, current state remapped, last acting [27,26,4,2147483647,15,2147483647]
    pg 21.2b5 is stuck inactive for 52865.857165, current state remapped, last acting [6,2147483647,21,27,11,2147483647]
    pg 21.3ed is stuck inactive for 52865.852710, current state remapped, last acting [26,18,14,20,2147483647,2147483647]

The placement groups are from an erasure coded pool.

# ceph osd erasure-code-profile get CLAYje4_2_5
crush-device-class=
crush-failure-domain=host
crush-root=default
d=5
k=4
m=2
plugin=clay

It helps restarting the primary osd from the stuck pgs to get them alive again.
This problem keeps us from using this cluster as a productive system.

I'm still a beginner with ceph and this cluster is still in testing phase.

What I'm doing wrong?
Is this problem a symptom of using the clay erasure code?

Thanks
Lars
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com