On Wed, 4 Jun 2008 10:00:16 -0700 "David Konerding" <dakoner@xxxxxxxxx> wrote: > >> Although we are using hard mounts, some users report that during the > >> hammering period, some of their > >> file operations produce "I/O Error" messages on their terminal. > >> > >> We checked, and the hosts are indeed using hard mounting. From our > >> reading, I/O Errors > >> should only ever make it back to the user if are using soft mounting. > >> > > hard/soft only governs what happens when there is a major timeout (i.e. > > the server doesn't respond within a given time). If there are other > > errors (for instance, client side memory shortage, server starts > > refusing connections, etc), then there can be errors returned to the > > application. > > > > OK; we're already using TCP mounts, so I don't think that any new > client->server connections > should occur after the mount is established. > Unless the connection is broken for some reason and the socket has to be reconnected. > Second, memory is not an issue; this happens on lightly loaded clients > with 64Gbytes RAM, > and RAM is all cache and buffer. > Yeah, you'd probably get a -ENOMEM or something if memory were short. I was just offering up that as an obvious way to get errors even if you're hard mounting. > > > EIO is pretty generic, and is often what you see when a more obscure > > error is translated into what a syscall would expect. It can happen for > > other reasons besides an RPC timeout. > > > OK, so, our best bet to debug this, is to: > 1) reproduce the problem > 2) when the problem occurs, make sure the command that run that got an > EIO was running > under strace, so we know what syscall was being made > 3) when we know what syscall was being made, backtrack to the kernel > source for that syscall > 4) inspect the source to see what paths generate EIO > > Dave Getting straces of the apps failing might be helpful, particularly if it's always in the same syscalls. I have a hunch though that you'll find yourself in the twisty maze of RPC code. In that case, knowing the particular syscalls might not be that informative. Looking at network captures might also be helpful. If you can correlate the straces with what's going over the wire, then you might be able to determine whether this error is being generated as a result of a NFS error from the server or something else entirely. NFS/RPC debugging might also be helpful (see rpcdebug manpage and note that it can have significant performance impact). -- Jeff Layton <jlayton@xxxxxxxxxx> ------------------------------------------------------------------------- Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php _______________________________________________ NFS maillist - NFS@xxxxxxxxxxxxxxxxxxxxx https://lists.sourceforge.net/lists/listinfo/nfs _______________________________________________ Please note that nfs@xxxxxxxxxxxxxxxxxxxxx is being discontinued. Please subscribe to linux-nfs@xxxxxxxxxxxxxxx instead. http://vger.kernel.org/vger-lists.html#linux-nfs -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html