Re: [NFS] I/O Errors with hard mounts

Jeff Layton <jlayton@xxxxxxxxxx> · Wed, 4 Jun 2008 13:58:17 -0400

On Wed, 4 Jun 2008 10:00:16 -0700
"David Konerding" <dakoner@xxxxxxxxx> wrote:

> >> Although we are using hard mounts, some users report that during the
> >> hammering period, some of their
> >> file operations produce "I/O Error" messages on their terminal.
> >>
> >> We checked, and the hosts are indeed using hard mounting.  From our
> >> reading, I/O Errors
> >> should only ever make it back to the user if are using soft mounting.
> >>
> > hard/soft only governs what happens when there is a major timeout (i.e.
> > the server doesn't respond within a given time). If there are other
> > errors (for instance, client side memory shortage, server starts
> > refusing connections, etc), then there can be errors returned to the
> > application.
> >
> 
> OK; we're already using TCP mounts, so I don't think that any new
> client->server connections
> should occur after the mount is established.
> 

Unless the connection is broken for some reason and the socket has
to be reconnected.

> Second, memory is not an issue; this happens on lightly loaded clients
> with 64Gbytes RAM,
> and RAM is all cache and buffer.
> 

Yeah, you'd probably get a -ENOMEM or something if memory were short. I
was just offering up that as an obvious way to get errors even if
you're hard mounting.

> 
> > EIO is pretty generic, and is often what you see when a more obscure
> > error is translated into what a syscall would expect. It can happen for
> > other reasons besides an RPC timeout.
> 
> 
> OK, so, our best bet to debug this, is to:
> 1) reproduce the problem
> 2) when the problem occurs, make sure the command that run that got an
> EIO was running
> under strace, so we know what syscall was being made
> 3) when we know what syscall was being made, backtrack to the kernel
> source for that syscall
> 4) inspect the source to see what paths generate EIO
> 
> Dave

Getting straces of the apps failing might be helpful, particularly if
it's always in the same syscalls. I have a hunch though that you'll find
yourself in the twisty maze of RPC code. In that case, knowing the
particular syscalls might not be that informative.

Looking at network captures might also be helpful. If you can correlate
the straces with what's going over the wire, then you might be able to
determine whether this error is being generated as a result of a NFS
error from the server or something else entirely.

NFS/RPC debugging might also be helpful (see rpcdebug manpage and note
that it can have significant performance impact).

-- 
Jeff Layton <jlayton@xxxxxxxxxx>

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
NFS maillist  -  NFS@xxxxxxxxxxxxxxxxxxxxx
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that nfs@xxxxxxxxxxxxxxxxxxxxx is being discontinued.
Please subscribe to linux-nfs@xxxxxxxxxxxxxxx instead.
    http://vger.kernel.org/vger-lists.html#linux-nfs

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html