Re: Optimal NFS mount options to safely allow interrupts and timeouts on newer kernels

Chuck Lever <chuck.lever@xxxxxxxxxx> · Wed, 5 Mar 2014 15:54:18 -0500

On Mar 5, 2014, at 3:15 PM, Brian Hawley <bhawley@xxxxxxxxxxx> wrote:

> 
> In my experience, you won't get the i/o errors reported back to the read/write/close operations.   I don't know for certain, but I suspect this may be due to caching and chunking to turn I/o matching the rsize/wsize settings; and possibly the fact that the peer disconnection isn't noticed unless the nfs server resets (ie cable disconnection isn't sufficient).
> 
> The inability to get the i/o errors back to the application has been a major pain for us.
> 
> On a lark we did find that repeated unmont -f's does get i/o errors back to the application, but isn't our preferred way.
> 
> 
> -----Original Message-----
> From: Andrew Martin <amartin@xxxxxxxxxxx>
> Sender: linux-nfs-owner@xxxxxxxxxxxxxxx
> Date: 	Wed, 5 Mar 2014 11:45:24 
> To: <linux-nfs@xxxxxxxxxxxxxxx>
> Subject: Optimal NFS mount options to safely allow interrupts and timeouts
> on newer kernels
> 
> Hello,
> 
> Is it safe to use the "soft" mount option with proto=tcp on newer kernels (e.g
> 3.2 and newer)? Currently using the "defaults" nfs mount options on Ubuntu
> 12.04 results in processes blocking forever in uninterruptable sleep if they
> attempt to access a mountpoint while the NFS server is offline. I would prefer
> that NFS simply return an error to the clients after retrying a few times, 
> however I also cannot have data loss. From the man page, I think these options
> will give that effect?
> soft,proto=tcp,timeo=10,retrans=3
> 
>> From my understanding, this will cause NFS to retry the connection 3 times (once
> per second), and then if all 3 are unsuccessful return an error to the
> application. Is this correct? Is there a risk of data loss or corruption by
> using "soft" in this way? Or is there a better way to approach this?

There is always a silent data corruption risk with “soft.” Using TCP and a long retransmit timeout mitigates the risk, but it is still there. A one second timeout for TCP is very short, and will almost certainly result in trouble, especially if the server or network are slow.

You should be able to ^C any waiting NFS process. Blocking forever is usually the sign of a bug.

In general, NFS is not especially tolerant of server unavailability. You may want to consider some other distributed file system protocol that is more fault-tolerant, or find ways to ensure your NFS servers are always accessible.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html