Hi I also forgot to add the following information which was discussed on
NFS mailing list with Chuck Lever, leading us to believe there is a
software bug in the kernel, not necessarily a server overload.
On the NFS server, we also mount some other NFS shares from other NFS
servers, over 1GbE:
150.x.x.116:/wing on /wing type nfs (rw,addr=150.x.x.116)
10.10.10.201:/opt/ftproot on /opt/ftproot type nfs
(rw,vers=4,addr=10.10.10.201,clientaddr=10.10.10.100)
150.x.x.202:/archive on /archive type nfs
(rw,vers=4,addr=150.x.x.202,clientaddr=128.x.x.2)
This hangup/bug seems to occur when we are reading/writing to these
other shares from the NFS server and the NFS server is also busy
processing our work from the cluster using the RDMA exports. There used
to be two other NFS mounts, which were used to send/write backups to,
and were scheduled every night at 8PM. I noticed the RDMA errors from
my original post were all showing up shortly after 8PM. So we decided
to get rid of these NFS mounts and convert the backup to transfer via
SSH instead. The RDMA errors stopped happening after 8PM when the
backup ran, but now the errors are still showing up, when we are
reading/writing to the other NFS mounts above that we still need.
It seems we should be able to use these different mounts and exports
without issue, leading us to believe there is a software bug somewhere.
Are there any other suggested solutions to this problem? Perhaps some
system, network and/or filesystem tuning? Any comments on adding the
"inode64,nobarrier" XFS mount options? Any extra information I can
gather to help with a bug report? Debug info or whatnot?
Thanks
_______________________________________________
CentOS mailing list
CentOS@xxxxxxxxxx
https://lists.centos.org/mailman/listinfo/centos