Re: PROBLEM: nfs I/O errors with sqlite applications

NeilBrown <neilb@xxxxxxxx> · Wed, 07 Jun 2017 13:08:08 +1000

On Tue, Jun 06 2017, Lutz Vieweg wrote:

> On 07/29/2016 07:52 PM, Jeff Layton wrote:
>>>>>>>>>    fcntl(7, F_SETLK, {type=F_RDLCK, whence=SEEK_SET,
>>>>>>>>> start=1073741824, len=1}) = -1 EIO (Input/output error)
>>>
>>> Unfortunately I did not manage to perform a network capture last time
>>> due to power loss.  I did not hit this issue again until yesterday (~9
>>> months later), this time after 45 days of uptime.
>>>
>>> Kernel versions now are: 4.5.1 on the server, and 4.4.3 on the client.
>
> I wanted to add that I, too, have one NFS client and server
> (running linux-4.11.0 on both the server and the client)
> currently in the same kind of state:
>
> I can reproduce in 100% of the cases that the following commands:
>
>> rm -f x.sqlite
>> sqlite3 x.sqlite "PRAGMA case_sensitive_like=1;PRAGMA synchronous=OFF;PRAGMA recursive_triggers=ON;PRAGMA foreign_keys=OFF;PRAGMA locking_mode = NORMAL;PRAGMA journal_mode =  TRUNCATE;"
>
> result in:
>
>>  "Error: disk I/O error"
>
> on the client - while working fine on the NFS server - with the same kind
> of strace output:
>
>>  fcntl(3, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=1073741824, len=1}) = -1 EIO (Input/output error)
>>  write(2, "Error: disk I/O error\n", 22Error: disk I/O error
>
> But unlike the original reporter, we use the NFS v3 protocol:
>> server:/data on /data type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,soft,proto=tcp,timeo=600,retrans=2,sec=sys,mountvers=3,mountport=20048,mountproto=udp,local_lock=none)
>
> If you want me to try or trace something on the client,
> I'm willing to help.

Using "soft" is not a good idea.  It could be the cause, but it isn't very
likely if NFS is otherwise working OK.

It might help to run
  rpcdebug -m nfs -s all; rpcdebug -m nlm -s all ;rpcdebug -m rpc -s all
  #repeat your test
  rpcdebug -m nfs -c all; rpcdebug -m nlm -c all ;rpcdebug -m rpc -c all

then collect the kernel logs (possibly just run "dmesg") and post all
the messages which happened at that time.

It might also help to find the port number that lockd is running on
   rpcinfo -p $SERVER | grep 'tcp.*nlockmgr'

(use the 4th column) and

  tcpdump -s 0 -w /tmp/trace.pcap port 2049 or port $LOCKD_PID &
  # run test
  killall tcpdump

gzip /tmp/trace.pcap and put it somewhere it can be fetched from - or
maybe post as an attachment if it isn't too big.

NeilBrown
Attachment:
signature.asc

Description: PGP signature