Summary:
This is a production CephFS cluster. I had an OSD node crash. The cluster rebalanced successfully. I brought the down node back online. Everything has rebalanced except 1 hung pg and MDS trimming is now behind. No hardware failures have become apparent yet.
Questions:
1) Is there a way to see what pool a placement group belongs to?
2) How should I move forward with unsticking my 1 pg in a constant remapped+peering state?
Based on the remapped+peering pg not going away and the mds trimming getting further and further behind, I'm guessing that the pg belongs to the cephfs metadata pool.
Any help you can provide is greatly appreciated.
Details:
OSD Node Description:
-2 vlans going over 40gig ethernet for pub/priv nets
-2 vlans going over 40gig ethernet for pub/priv nets
-256 GB RAM
-2x Xeon 2660v4
-2x P3700 (journal)
-24x OSD
Primary monitor is dedicated similar configuration to OSD
Primary MDS is dedicated similar configuration to OSD
[brady@mon0 ~]$ ceph health detail
HEALTH_ERR 1 pgs are stuck inactive for more than 300 seconds; 1 pgs peering; 1 pgs stuck inactive; 47 requests are blocked > 32 sec; 1 osds have slow requests; mds0: Behind on trimming (76/30)
pg 1.efa is stuck inactive for 174870.396769, current state remapped+peering, last acting [153,162,5]
pg 1.efa is remapped+peering, acting [153,162,5]
34 ops are blocked > 268435 sec on osd.153
13 ops are blocked > 134218 sec on osd.153
1 osds have slow requests
mds0: Behind on trimming (76/30)(max_segments: 30, num_segments: 76)
[brady@mon0 ~]$ ceph pg dump_stuck
ok
pg_stat state up up_primary acting acting_primary
1.efa remapped+peering [153,10,162] 153 [153,162,5] 153
[brady@mon0 ~]$ ceph pg 1.efa query
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com