On 06/09/2017 12:07 AM, NeilBrown wrote:
But "soft" is generally a bad idea. It can lead to data corruption in various way as it ports errors to user-space which user-space is often not expecting.
From reading "man 5 nfs" I understood the one situation in which this option makes a difference is when the NFS server becomes unavailable/unreachable. With "hard" user-space applications will wait indefinitely in the hope that the NFS service will become available again. I see that if there was only some temporary glitch with connectivity to the NFS server, this waiting might yield a better outcome - but that should be covered by the timeout grace periods anyway. But if: - An unreachability of the service persists for a very long time, it is bad that it will take a very long time for any monitoring of the applications on the server to notice that this is no longer a tolerable situation, so some sort of fail-over to different application instances need to be triggered - The unavailability/unreachability of the service is resolved by rebooting the NFS server, chances are that the files are then in a different state than before (due to reverting to the last known consistent state of the local filesystem on the server), and in that situation I don't want to fool the client into thinking that everything I/O-wise is fine - better signal an error to make the application aware of the situation - The unavailability/unreachability of the service is unresolvable, because the primary NFS server died completely, then the files will clearly be in a different state once a secondary service is brought up - and a "kill -9" on all the processes waiting for NFS-I/O seems equally likely to me to cause the applications trouble than returning an error on the pending I/O operations.
These days, the processes in D state are (usually) killable.
If that is true for processes waiting on (hard) mounted NFS services, that is really appreciated and good to know. It would certainly help us next time we try a newer NFS protocol release :-) (BTW: I recently had to reboot a machine because processes who waited for access to a long-removed USB device persisted in D-state... and were immune to "kill -9". So at least the USB driver subsystem seems to still contain such pitfalls.)
Thanks. Probably the key line is [2339904.695240] RPC: 46702 remote rpcbind: RPC program/version unavailable The client is trying to talk to lockd on the server, and lockd doesn't seem to be there.
"ps" however says there is a process of that name running on that server:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 3753 0.0 0.0 0 0 ? S May26 0:02 \_ [lockd]
Your assumption:
My guess is that rpcbind was restarted with the "-w" flag, so it lost all the state that it previosly had.
seems to be right:
> systemctl status rpcbind ● rpcbind.service - RPC bind service Loaded: loaded (/usr/lib/systemd/system/rpcbind.service; enabled; vendor preset: enabled) Active: active (running) since Wed 2017-05-31 10:06:05 CEST; 1 weeks 2 days ago Process: 14043 ExecStart=/sbin/rpcbind -w $RPCBIND_ARGS (code=exited, status=0/SUCCESS) Main PID: 14044 (rpcbind) CGroup: /system.slice/rpcbind.service └─14044 /sbin/rpcbind -w May 31 10:06:05 myserver systemd[1]: Starting RPC bind service... May 31 10:06:05 myserver systemd[1]: Started RPC bind service.
If that kind of invocation is known to cause trouble, I wonder why RedHat/CentOS chose to make it wath seems to be their default...
If you stop and restart NFS service on the server, it might start working again. Otherwise just reboot the nfs server.
A "systemctl stop nfs ; systemctl start nfs" was not sufficent, only changed the symptom:
sqlite3 x.sqlite "PRAGMA case_sensitive_like=1;PRAGMA synchronous=OFF;PRAGMA recursive_triggers=ON;PRAGMA foreign_keys=OFF;PRAGMA locking_mode = NORMAL;PRAGMA journal_mode = TRUNCATE;" Error: database is locked
On the server, at the same time, the following message is emitted to the system log:
Jun 9 12:53:57 myserver kernel: lockd: cannot monitor myclient
What did help, however, was running:
systemctl stop rpc-statd ; systemctl start rpc-statd
on the server. So thanks for your analysis! - We now know a way to remove the symptom with relatively little disturbance of services. Should we somehow try to get rid of that "-w" to rpcbind, in an attempt to not re-trigger the symptom at a later time? Regards, Lutz Vieweg -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html