In the company I work for we just deployed a small cluster of Linux
workstations, using FreeIPA/NFS/Kerberos and NFS-shared home
directories. The NFS server runs on CentOS 7 while the clients are on
Rocky Linux 8. As far as I can see from the mount command, v4.1 of the
NFS protocol is being used.
During the first days of working with this new setup, we started
observing errors on most client machines while working on a
NFS-mounted directory: specifically, while compiling with gcc/ld, the
ld linker often failed with a
“: final close failed: Input/output error”
This error could be consistently triggered once it appeared, however
the conditions in order to reproduce it among different clients are
not clear to us. The expected output of the compilation is a ~40 MB
.so file.
Further investigations and a call with strace revealed that a close()
function was failing with a “-1/EIO” error, thus causing the whole
compilation to fail.
Enabling some extra debugging info via rpcdebug for the NFS client and
server provided some useful insights and, by looking at these logs,
somebody in the #centos IRC channel pointed to NFS 4.1 server
delegations feature as a potential culprit (and suggested sending a
message to this ML).
Effectively, after echoing a 0 to /proc/sys/fs/leases-enable on the
NFS server and remounting the NFS volume on the client, the issue
appears to be fixed.
Furthermore, on some clients where the NFS volume remount hasn’t yet
been performed, the ld operation will still fail with the same error,
again without a clear pattern for reproducibility.
I have collected a strace of the failing gcc/ld compilation and a
tcpdump capture of the traffic between the NFS client and server
during the failing compilation, hoping it could be useful for somebody
to shine some light on the issue.
Versions involved:
NFS server: kernel version 5.4.175-1.el7.elrepo x86_64
NFS clients: kernel version 5.4.178-1.el8.elrepo x86_64, gcc 8.5.0
20210514, ld 2.30-108.el8_5.1
Kind regards,
Sebastiano Pomata