Re: Optimal NFS mount options to safely allow interrupts and timeouts on newer kernels

"Brian Hawley" <bhawley@xxxxxxxxxxx> · Thu, 6 Mar 2014 18:56:33 +0000

Using umount -f repeatedly did eventually get i/o errors back to all the read/writes.

I understand Ric's comment about using fsync, and we do in fact use fsync at data synchronization points (like close, seeks, changes from write to read, etc -- ours is a sequential i/o application).   But it is these writes and reads that end up hung most of the time; not an fsync call.   I suspect because it is the writes that eventually get the cache/buffers to the point where that write has to block until the cache gets some block flushed to make room.

-----Original Message-----
From: Andrew Martin <amartin@xxxxxxxxxxx>
Date: Thu, 6 Mar 2014 09:30:21 
To: <bhawley@xxxxxxxxxxx>
Cc: NeilBrown<neilb@xxxxxxx>; <linux-nfs-owner@xxxxxxxxxxxxxxx>; <linux-nfs@xxxxxxxxxxxxxxx>
Subject: Re: Optimal NFS mount options to safely allow interrupts and
 timeouts on newer kernels

> From: "Brian Hawley" <bhawley@xxxxxxxxxxx>
> 
> I ended up writing a "manage_mounts" script run by cron that compares
> /proc/mounts and the fstab, used ping, and "timeout" messages in
> /var/log/messages to identify filesystems that aren't responding, repeatedly
> do umount -f to force i/o errors back to the calling applications; and when
> missing mounts (in fstab but not /proc/mounts) but were now pingable,
> attempt to remount them.
> 
> 
> For me, timeo and retrans are necessary, but not sufficient.  The chunking to
> rsize/wsize and caching plays a role in how well i/o errors get relayed back
> to the applications doing the i/o.
> 
> You will certainly lose data in these scenario's.
> 
> It would be fantastic if somehow the timeo and retrans were sufficient (ie
> when they fail, i/o errors get back to the applications that queued that i/o
> (or even the i/o that cause the application to pend because the rsize/wsize
> or cache was full).
> 
> You can eliminate some of that behavior with sync/directio, but performance
> becomes abysmal.
> 
> I tried "lazy" it didn't provide the desired effect (they unmounted which
> prevented new i/o's; but existing I/o's never got errors).
This is the problem I am having - I can unmount the filesystem with -l, but
once it is unmounted the existing apache processes are still stuck forever.
Does repeatedly running "umount -f" instead of "umount -l" as you describe
return I/O errors back to existing processes and allow them to stop?

> From: "Jim Rees" <rees@xxxxxxxxx>
> Given this is apache, I think if I were doing this I'd use ro,soft,intr,tcp
> and not try to write anything to nfs.
I was using tcp,bg,soft,intr when this problem occurred. I do not know if
apache was attempting to do a write or a read, but it seems that tcp,soft,intr
was not sufficient to prevent the problem. 

ÿôèº{.nÇ+?·?®??+%?Ëÿ±éÝ¶¥?wÿº{.nÇ+?·¥?{±þwìþ)í?æèw*jg¬±¨¶????Ý¢jÿ¾«þG«?éÿ¢¸¢·¦j:+v?¨?wèjØm¶?ÿþø¯ù®w¥þ?àþf£¢·h??â?úÿ?Ù¥