Chuck Lever wrote on 07/14/2018 07:37 AM> I wasn't entirely clear: Does
pac mount itself?
No, why would we do that? Do people do that? Here is a listing of
relevant mounts on our server pac:
/dev/sdc1 on /data type xfs (rw)
/dev/sdb1 on /projects type xfs (rw)
/dev/sde1 on /working type xfs (rw,nobarrier)
nfsd on /proc/fs/nfsd type nfsd (rw)
/dev/drbd0 on /newwing type xfs (rw)
150.x.x.116:/wing on /wing type nfs (rw,addr=150.x.x.116)
150.x.x.116:/archive on /archive type nfs (rw,addr=150.x.x.116)
150.x.x.116:/backups on /backups type nfs (rw,addr=150.x.x.116)
The backup jobs read from the mounted local disks /data and /projects
and write to the remote NFS server at /backups and /archive. I have
noticed in the log files for our other servers which mount the pac
exports, "nfs: server pac not responding, timed out" which all show up
after 8PM when the backup jobs are running.
And here is listing of our pac server exports:
/data 10.10.10.0/24(rw,no_root_squash,async)
/data 10.10.11.0/24(rw,no_root_squash,async)
/data 150.x.x.192/27(rw,no_root_squash,async)
/data 150.x.x.64/26(rw,no_root_squash,async)
/home 10.10.10.0/24(rw,no_root_squash,async)
/home 10.10.11.0/24(rw,no_root_squash,async)
/opt 10.10.10.0/24(rw,no_root_squash,async)
/opt 10.10.11.0/24(rw,no_root_squash,async)
/projects 10.10.10.0/24(rw,no_root_squash,async)
/projects 10.10.11.0/24(rw,no_root_squash,async)
/projects 150.x.x.192/27(rw,no_root_squash,async)
/projects 150.x.x.64/26(rw,no_root_squash,async)
/tools 10.10.10.0/24(rw,no_root_squash,async)
/tools 10.10.11.0/24(rw,no_root_squash,async)
/usr/share/gridengine 10.10.10.10/24(rw,no_root_squash,async)
/usr/share/gridengine 10.10.11.10/24(rw,no_root_squash,async)
/usr/local 10.10.10.10/24(rw,no_root_squash,async)
/usr/local 10.10.11.10/24(rw,no_root_squash,async)
/working 10.10.10.0/24(rw,no_root_squash,async)
/working 10.10.11.0/24(rw,no_root_squash,async)
/working 150.x.x.192/27(rw,no_root_squash,async)
/working 150.x.x.64/26(rw,no_root_squash,async)
/newwing 10.10.10.0/24(rw,no_root_squash,async)
/newwing 10.10.11.0/24(rw,no_root_squash,async)
/newwing 150.x.x.192/27(rw,no_root_squash,async)
/newwing 150.x.x.64/26(rw,no_root_squash,async)
The 10.10.10.0/24 network is 1GbE and the 10.10.11.0/24 is the
Infiniband. The other networks are also 1GbE. Our cluster nodes will
normally mount all of these using the Infiniband with RDMA and the
computation jobs will normally be using /working which will see the most
reading/writing but /newwing, /projects, and /data are also used.
It does continue to seem to be a bug in NFS. Somehow seems to be
triggered when the NFS server runs the backup job. I just tried it now
and about 20 mins into the backup job the server stopped responding to
some things, like iotop froze. top remained active and could see the
load on the server going up but only to about 22/24 and still about 95%
idle cpu time. Also noticed the "nfs: server pac not responding, timed
out" messages on our other servers. After about 10 minutes the server
became responsive again and load dropped down to 3/24 while the backup
job continued.
Perhaps it could be mitigated if I change the backup job to use SSH
instead of NFS. I'll try that and see if it helps, then once our job
has completed I can try going back to RDMA to see if it still happens....
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html