Re: PROBLEM: nfs I/O errors with sqlite applications

NeilBrown <neilb@xxxxxxxx> · Fri, 09 Jun 2017 08:07:23 +1000

On Thu, Jun 08 2017, Lutz Vieweg wrote:

> On 06/07/2017 05:08 AM, NeilBrown wrote:
>>>>   fcntl(3, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=1073741824, len=1}) = -1 EIO (Input/output error)
>>>>   write(2, "Error: disk I/O error\n", 22Error: disk I/O error
>>>
>>> But unlike the original reporter, we use the NFS v3 protocol:
>>>> myserver:/data on /data type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,soft,proto=tcp,timeo=600,retrans=2,sec=sys,mountvers=3,mountport=20048,mountproto=udp,local_lock=none)
>>
>> Using "soft" is not a good idea.  It could be the cause, but it isn't very
>> likely if NFS is otherwise working OK.
>
> NFS v3 has been working very well for us for many years.
> When we upgraded those two servers ~3 years ago, we did try NFS v4 first, but
> that had caused frequent occurences of "un-killable processes in D state",
> so we had to revert to v3 to allow for stable operation.

I queried the use of "soft" - as opposed to "hard".
You defend the use of v3 as opposed to v4.
I think there is some miscommunication happening here.

If v3 works better for you than v4, then certainly use it.
You could try reporting details of the problems with v4, but I cannot
promise a helpful response, so it is totally up to you.

But "soft" is generally a bad idea.  It can lead to data corruption in
various way as it ports errors to user-space which user-space is often
not expecting.

These days, the processes in D state are (usually) killable.

>
>> It might help to run
>>    rpcdebug -m nfs -s all; rpcdebug -m nlm -s all ;rpcdebug -m rpc -s all
>>    #repeat your test
>>    rpcdebug -m nfs -c all; rpcdebug -m nlm -c all ;rpcdebug -m rpc -c all
>>
>> then collect the kernel logs (possibly just run "dmesg") and post all
>> the messages which happened at that time.
>
> Ok, attaching a log generated like this while running:
>
> sqlite3 x.sqlite "PRAGMA case_sensitive_like=1;PRAGMA synchronous=OFF;PRAGMA 
> recursive_triggers=ON;PRAGMA foreign_keys=OFF;PRAGMA locking_mode = NORMAL;PRAGMA journal_mode = 
> TRUNCATE;"

Thanks. Probably the key line is

[2339904.695240] RPC: 46702 remote rpcbind: RPC program/version unavailable

The client is trying to talk to lockd on the server, and lockd doesn't
seem to be there.

>
>> It might also help to find the port number that lockd is running on
>>     rpcinfo -p $SERVER | grep 'tcp.*nlockmgr'
>
> None of the ports reported this way contains the string "nlockmgr":

This agrees with the line from the log.  If nlockmgr isn't listed, then
locking cannot work.  This is the cause of your problem.

>> rpcinfo -p myserver
>>    program vers proto   port  service
>>     100000    4   tcp    111  portmapper
>>     100000    3   tcp    111  portmapper
>>     100000    2   tcp    111  portmapper
>>     100000    4   udp    111  portmapper
>>     100000    3   udp    111  portmapper
>>     100000    2   udp    111  portmapper

Even "nfs" isn't listed - but clearly the nfs server is running.

My guess is that rpcbind was restarted with the "-w" flag, so it lost
all the state that it previosly had.
If you stop and restart NFS service on the server, it might start
working again.  Otherwise just reboot the nfs server.

NeilBrown

>
> Regards,
>
> Lutz Vieweg
Attachment:
signature.asc

Description: PGP signature