Using umount -f repeatedly did eventually get i/o errors back to all the read/writes. I understand Ric's comment about using fsync, and we do in fact use fsync at data synchronization points (like close, seeks, changes from write to read, etc -- ours is a sequential i/o application). But it is these writes and reads that end up hung most of the time; not an fsync call. I suspect because it is the writes that eventually get the cache/buffers to the point where that write has to block until the cache gets some block flushed to make room. -----Original Message----- From: Andrew Martin <amartin@xxxxxxxxxxx> Date: Thu, 6 Mar 2014 09:30:21 To: <bhawley@xxxxxxxxxxx> Cc: NeilBrown<neilb@xxxxxxx>; <linux-nfs-owner@xxxxxxxxxxxxxxx>; <linux-nfs@xxxxxxxxxxxxxxx> Subject: Re: Optimal NFS mount options to safely allow interrupts and timeouts on newer kernels > From: "Brian Hawley" <bhawley@xxxxxxxxxxx> > > I ended up writing a "manage_mounts" script run by cron that compares > /proc/mounts and the fstab, used ping, and "timeout" messages in > /var/log/messages to identify filesystems that aren't responding, repeatedly > do umount -f to force i/o errors back to the calling applications; and when > missing mounts (in fstab but not /proc/mounts) but were now pingable, > attempt to remount them. > > > For me, timeo and retrans are necessary, but not sufficient. The chunking to > rsize/wsize and caching plays a role in how well i/o errors get relayed back > to the applications doing the i/o. > > You will certainly lose data in these scenario's. > > It would be fantastic if somehow the timeo and retrans were sufficient (ie > when they fail, i/o errors get back to the applications that queued that i/o > (or even the i/o that cause the application to pend because the rsize/wsize > or cache was full). > > You can eliminate some of that behavior with sync/directio, but performance > becomes abysmal. > > I tried "lazy" it didn't provide the desired effect (they unmounted which > prevented new i/o's; but existing I/o's never got errors). This is the problem I am having - I can unmount the filesystem with -l, but once it is unmounted the existing apache processes are still stuck forever. Does repeatedly running "umount -f" instead of "umount -l" as you describe return I/O errors back to existing processes and allow them to stop? > From: "Jim Rees" <rees@xxxxxxxxx> > Given this is apache, I think if I were doing this I'd use ro,soft,intr,tcp > and not try to write anything to nfs. I was using tcp,bg,soft,intr when this problem occurred. I do not know if apache was attempting to do a write or a read, but it seems that tcp,soft,intr was not sufficient to prevent the problem. ÿôèº{.nÇ+?·?®??+%?Ëÿ±éݶ¥?wÿº{.nÇ+?·¥?{±þwìþ)í?æèw*jg¬±¨¶????Ý¢jÿ¾«þG«?éÿ¢¸¢·¦j:+v?¨?wèjØm¶?ÿþø¯ù®w¥þ?àþf£¢·h??â?úÿ?Ù¥