"Failed to authpin" results in large number of blocked requests

Zoë O'Connell <zoe+ceph@xxxxxxxxxx> · Thu, 28 Mar 2019 10:02:04 +0000

We're running a Ceph mimic (13.2.4) cluster which is predominantly used 
for CephFS. We have recently switched to using multiple active MDSes to 
cope with load on the cluster, but are experiencing problems with large 
numbers of blocked requests when research staff run large experiments. 
The error associated with the block is:

2019-03-28 09:31:34.246326 [WRN]  6 slow requests, 0 included below; 
oldest blocked for > 423.987868 secs
2019-03-28 09:31:29.246202 [WRN]  slow request 62.572806 seconds old, 
received at 2019-03-28 09:30:26.673298: 
client_request(client.5882168:1404749 lookup #0x10000000441/run_output 
2019-03-28 09:30:26.653089 caller_uid=0, caller_gid=0{}) currently 
failed to authpin, subtree is being exported

Eventually, many hundreds of requests are blocked for hours.

It appears (As alluded to by the subtree is being exported error) that 
this is related to the MDSes remapping entries between ranks under load, 
as it is always accompanied by messages along the lines of 
"mds.0.migrator nicely exporting to mds.1". Migrations that occur when 
the cluster is not under heavy load complete OK, but under load it seems 
the operation is not completed or entering deadlock for some reason.

We can clear the immediate problem by restarting the affected MDS, and 
have a partial solution by using subtree pinning on everything but this 
is far from ideal.  Does anyone have any pointers where else we should 
be looking to troubleshoot this?

Thanks,

Zoe.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com