Hi All
MDS active/passive
Jewel 10.2.2
Ceph client 3.10.0-514.6.1.el7.x86_64
Cephfs mount: (rw,relatime,name=admin,secret=<hidden>,acl)
I can see some slow requests in the MDS log during the time the NFS processes were hung, some for setattr calls:
2017-06-15 04:29:37.081175 7f889401f700 0 log_channel(cluster) log [WRN] : slow request 60.974528 seconds old, received at 2017-06-15 04:
28:36.106598: client_request(client.2622511:116375892 setattr size=0 #100025b3554 2017-06-15 04:28:36.104928) currently acquired locks
and some for getattr:
2017-06-15 04:29:42.081224 7f889401f700 0 log_channel(cluster) log [WRN] : slow request 32.225883 seconds old, received at 2017-06-15 04:
29:09.855302: client_request(client.2622511:116380541 getattr pAsLsXsFs #100025b4d37 2017-06-15 04:29:09.853772) currently failed to rdloc
k, waiting
And a "client not responding to mclientcaps revoke" warning:
2017-06-15 04:31:12.084561 7f889401f700 0 log_channel(cluster) log [WRN] : client.2344872 isn't responding to mclientcaps(revoke), ino 100025b4d37 pending pAsxLsXsxFcb issued pAsxLsXsxFsxcrwb, sent 122.229172 seconds ag
These issues seemed to have cleared once the faulty OSD was marked out.
In general I have noticed the NFS processes exporting Cephfs do seem to spend a lot of time in 'D' state, with WCHAN as 'lock_page', compared with a NFS server exporting a local file system. Also, NFS performance hasn't been great with small reads/writes, particularly writes with the default sync export option, I've had to export with async for the time-being. I haven't had a chance to troubleshoot this in any depth yet, just mentioning in case it's relevant.
Thanks,
David
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com