On Thu, Jun 08 2017, Lutz Vieweg wrote: > On 06/07/2017 05:08 AM, NeilBrown wrote: >>>> fcntl(3, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=1073741824, len=1}) = -1 EIO (Input/output error) >>>> write(2, "Error: disk I/O error\n", 22Error: disk I/O error >>> >>> But unlike the original reporter, we use the NFS v3 protocol: >>>> myserver:/data on /data type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,soft,proto=tcp,timeo=600,retrans=2,sec=sys,mountvers=3,mountport=20048,mountproto=udp,local_lock=none) >> >> Using "soft" is not a good idea. It could be the cause, but it isn't very >> likely if NFS is otherwise working OK. > > NFS v3 has been working very well for us for many years. > When we upgraded those two servers ~3 years ago, we did try NFS v4 first, but > that had caused frequent occurences of "un-killable processes in D state", > so we had to revert to v3 to allow for stable operation. I queried the use of "soft" - as opposed to "hard". You defend the use of v3 as opposed to v4. I think there is some miscommunication happening here. If v3 works better for you than v4, then certainly use it. You could try reporting details of the problems with v4, but I cannot promise a helpful response, so it is totally up to you. But "soft" is generally a bad idea. It can lead to data corruption in various way as it ports errors to user-space which user-space is often not expecting. These days, the processes in D state are (usually) killable. > >> It might help to run >> rpcdebug -m nfs -s all; rpcdebug -m nlm -s all ;rpcdebug -m rpc -s all >> #repeat your test >> rpcdebug -m nfs -c all; rpcdebug -m nlm -c all ;rpcdebug -m rpc -c all >> >> then collect the kernel logs (possibly just run "dmesg") and post all >> the messages which happened at that time. > > Ok, attaching a log generated like this while running: > > sqlite3 x.sqlite "PRAGMA case_sensitive_like=1;PRAGMA synchronous=OFF;PRAGMA > recursive_triggers=ON;PRAGMA foreign_keys=OFF;PRAGMA locking_mode = NORMAL;PRAGMA journal_mode = > TRUNCATE;" Thanks. Probably the key line is [2339904.695240] RPC: 46702 remote rpcbind: RPC program/version unavailable The client is trying to talk to lockd on the server, and lockd doesn't seem to be there. > >> It might also help to find the port number that lockd is running on >> rpcinfo -p $SERVER | grep 'tcp.*nlockmgr' > > None of the ports reported this way contains the string "nlockmgr": This agrees with the line from the log. If nlockmgr isn't listed, then locking cannot work. This is the cause of your problem. >> rpcinfo -p myserver >> program vers proto port service >> 100000 4 tcp 111 portmapper >> 100000 3 tcp 111 portmapper >> 100000 2 tcp 111 portmapper >> 100000 4 udp 111 portmapper >> 100000 3 udp 111 portmapper >> 100000 2 udp 111 portmapper Even "nfs" isn't listed - but clearly the nfs server is running. My guess is that rpcbind was restarted with the "-w" flag, so it lost all the state that it previosly had. If you stop and restart NFS service on the server, it might start working again. Otherwise just reboot the nfs server. NeilBrown > > Regards, > > Lutz Vieweg
Attachment:
signature.asc
Description: PGP signature