http://tracker.ceph.com/issues/25131 may relieve the issue. please try ceph version 13.2.5. Regards Yan, Zheng On Thu, Mar 28, 2019 at 6:02 PM Zoë O'Connell <zoe+ceph@xxxxxxxxxx> wrote: > > We're running a Ceph mimic (13.2.4) cluster which is predominantly used > for CephFS. We have recently switched to using multiple active MDSes to > cope with load on the cluster, but are experiencing problems with large > numbers of blocked requests when research staff run large experiments. > The error associated with the block is: > > 2019-03-28 09:31:34.246326 [WRN] 6 slow requests, 0 included below; > oldest blocked for > 423.987868 secs > 2019-03-28 09:31:29.246202 [WRN] slow request 62.572806 seconds old, > received at 2019-03-28 09:30:26.673298: > client_request(client.5882168:1404749 lookup #0x10000000441/run_output > 2019-03-28 09:30:26.653089 caller_uid=0, caller_gid=0{}) currently > failed to authpin, subtree is being exported > > Eventually, many hundreds of requests are blocked for hours. > > It appears (As alluded to by the subtree is being exported error) that > this is related to the MDSes remapping entries between ranks under load, > as it is always accompanied by messages along the lines of > "mds.0.migrator nicely exporting to mds.1". Migrations that occur when > the cluster is not under heavy load complete OK, but under load it seems > the operation is not completed or entering deadlock for some reason. > > We can clear the immediate problem by restarting the affected MDS, and > have a partial solution by using subtree pinning on everything but this > is far from ideal. Does anyone have any pointers where else we should > be looking to troubleshoot this? > > Thanks, > > Zoe. > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com