Re: PROBLEM: nfs I/O errors with sqlite applications

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 06/09/2017 12:07 AM, NeilBrown wrote:
But "soft" is generally a bad idea.  It can lead to data corruption in
various way as it ports errors to user-space which user-space is often
not expecting.

From reading "man 5 nfs" I understood the one situation in which this
option makes a difference is when the NFS server becomes unavailable/unreachable.

With "hard" user-space applications will wait indefinitely in the hope
that the NFS service will become available again.

I see that if there was only some temporary glitch with connectivity
to the NFS server, this waiting might yield a better outcome - but that
should be covered by the timeout grace periods anyway.

But if:

- An unreachability of the service persists for a very long time,
  it is bad that it will take a very long time for any monitoring
  of the applications on the server to notice that this is no longer
  a tolerable situation, so some sort of fail-over to different application
  instances need to be triggered

- The unavailability/unreachability of the service is resolved by rebooting
  the NFS server, chances are that the files are then in a different state
  than before (due to reverting to the last known consistent state of
  the local filesystem on the server), and in that situation I don't
  want to fool the client into thinking that everything I/O-wise is fine -
  better signal an error to make the application aware of the situation

- The unavailability/unreachability of the service is unresolvable, because
  the primary NFS server died completely, then the files will clearly be
  in a different state once a secondary service is brought up - and a
  "kill -9" on all the processes waiting for NFS-I/O seems equally likely
  to me to cause the applications trouble than returning an error on
  the pending I/O operations.

These days, the processes in D state are (usually) killable.

If that is true for processes waiting on (hard) mounted NFS services,
that is really appreciated and good to know. It would certainly help
us next time we try a newer NFS protocol release :-)

(BTW: I recently had to reboot a machine because processes who
waited for access to a long-removed USB device persisted in D-state...
and were immune to "kill -9". So at least the USB driver subsystem
seems to still contain such pitfalls.)

Thanks. Probably the key line is

[2339904.695240] RPC: 46702 remote rpcbind: RPC program/version unavailable

The client is trying to talk to lockd on the server, and lockd doesn't
seem to be there.

"ps" however says there is a process of that name running on that server:
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      3753  0.0  0.0      0     0 ?        S    May26   0:02  \_ [lockd]

Your assumption:
My guess is that rpcbind was restarted with the "-w" flag, so it lost
all the state that it previosly had.
seems to be right:

> systemctl status rpcbind
● rpcbind.service - RPC bind service
   Loaded: loaded (/usr/lib/systemd/system/rpcbind.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2017-05-31 10:06:05 CEST; 1 weeks 2 days ago
  Process: 14043 ExecStart=/sbin/rpcbind -w $RPCBIND_ARGS (code=exited, status=0/SUCCESS)
 Main PID: 14044 (rpcbind)
   CGroup: /system.slice/rpcbind.service
           └─14044 /sbin/rpcbind -w

May 31 10:06:05 myserver systemd[1]: Starting RPC bind service...
May 31 10:06:05 myserver systemd[1]: Started RPC bind service.

If that kind of invocation is known to cause trouble, I wonder why
RedHat/CentOS chose to make it wath seems to be their default...

If you stop and restart NFS service on the server, it might start
working again.  Otherwise just reboot the nfs server.

A "systemctl stop nfs ; systemctl start nfs" was not sufficent, only changed the symptom:
sqlite3 x.sqlite "PRAGMA case_sensitive_like=1;PRAGMA synchronous=OFF;PRAGMA recursive_triggers=ON;PRAGMA foreign_keys=OFF;PRAGMA locking_mode = NORMAL;PRAGMA journal_mode = TRUNCATE;"
Error: database is locked

On the server, at the same time, the following message is emitted to the system log:
Jun  9 12:53:57 myserver kernel: lockd: cannot monitor myclient

What did help, however, was running:
systemctl stop rpc-statd ; systemctl start rpc-statd
on the server.

So thanks for your analysis! - We now know a way to remove the symptom
with relatively little disturbance of services.

Should we somehow try to get rid of that "-w" to rpcbind, in an attempt
to not re-trigger the symptom at a later time?

Regards,

Lutz Vieweg

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux