Re: RDMA connection closed and not re-opened

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Chuck Lever wrote on 07/14/2018 07:37 AM> I wasn't entirely clear: Does pac mount itself? No, why would we do that? Do people do that? Here is a listing of relevant mounts on our server pac:

/dev/sdc1 on /data type xfs (rw)
/dev/sdb1 on /projects type xfs (rw)
/dev/sde1 on /working type xfs (rw,nobarrier)
nfsd on /proc/fs/nfsd type nfsd (rw)
/dev/drbd0 on /newwing type xfs (rw)
150.x.x.116:/wing on /wing type nfs (rw,addr=150.x.x.116)
150.x.x.116:/archive on /archive type nfs (rw,addr=150.x.x.116)
150.x.x.116:/backups on /backups type nfs (rw,addr=150.x.x.116)

The backup jobs read from the mounted local disks /data and /projects and write to the remote NFS server at /backups and /archive. I have noticed in the log files for our other servers which mount the pac exports, "nfs: server pac not responding, timed out" which all show up after 8PM when the backup jobs are running.

And here is listing of our pac server exports:

/data	10.10.10.0/24(rw,no_root_squash,async)
/data	10.10.11.0/24(rw,no_root_squash,async)
/data	150.x.x.192/27(rw,no_root_squash,async)
/data	150.x.x.64/26(rw,no_root_squash,async)
/home	10.10.10.0/24(rw,no_root_squash,async)
/home	10.10.11.0/24(rw,no_root_squash,async)
/opt	10.10.10.0/24(rw,no_root_squash,async)
/opt	10.10.11.0/24(rw,no_root_squash,async)
/projects	10.10.10.0/24(rw,no_root_squash,async)
/projects	10.10.11.0/24(rw,no_root_squash,async)
/projects	150.x.x.192/27(rw,no_root_squash,async)
/projects	150.x.x.64/26(rw,no_root_squash,async)
/tools	10.10.10.0/24(rw,no_root_squash,async)
/tools	10.10.11.0/24(rw,no_root_squash,async)
/usr/share/gridengine     10.10.10.10/24(rw,no_root_squash,async)
/usr/share/gridengine     10.10.11.10/24(rw,no_root_squash,async)
/usr/local	10.10.10.10/24(rw,no_root_squash,async)
/usr/local	10.10.11.10/24(rw,no_root_squash,async)
/working	10.10.10.0/24(rw,no_root_squash,async)
/working	10.10.11.0/24(rw,no_root_squash,async)
/working	150.x.x.192/27(rw,no_root_squash,async)
/working	150.x.x.64/26(rw,no_root_squash,async)
/newwing	10.10.10.0/24(rw,no_root_squash,async)
/newwing	10.10.11.0/24(rw,no_root_squash,async)
/newwing	150.x.x.192/27(rw,no_root_squash,async)
/newwing	150.x.x.64/26(rw,no_root_squash,async)

The 10.10.10.0/24 network is 1GbE and the 10.10.11.0/24 is the Infiniband. The other networks are also 1GbE. Our cluster nodes will normally mount all of these using the Infiniband with RDMA and the computation jobs will normally be using /working which will see the most reading/writing but /newwing, /projects, and /data are also used.

It does continue to seem to be a bug in NFS. Somehow seems to be triggered when the NFS server runs the backup job. I just tried it now and about 20 mins into the backup job the server stopped responding to some things, like iotop froze. top remained active and could see the load on the server going up but only to about 22/24 and still about 95% idle cpu time. Also noticed the "nfs: server pac not responding, timed out" messages on our other servers. After about 10 minutes the server became responsive again and load dropped down to 3/24 while the backup job continued.

Perhaps it could be mitigated if I change the backup job to use SSH instead of NFS. I'll try that and see if it helps, then once our job has completed I can try going back to RDMA to see if it still happens....


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux