Re: NFSv4 mounts take longer the fail from ENETUNREACH than NFSv3 mounts.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




Hi all,

 I wold like to resurrect this old thread from last October, as the issue is
still a problem.

i.e. when the NFSv4 mount client in the kernel gets ENETUNREACH it doesn't
fail immediately but rather it times out and then fails.
Note that it doesn't timeout, retry, then fail.
Rather it gets the error status, waits the timeout (2 minutes), and then fails
with no retry.

The message below was the last in the thread.  There-in Trond says something
which sounds sensible and relevant, but unfortunately I completely fail to
understand the full relevance of because I don't know the ins-and-outs of the
rpc layer well enough.

Trond (or anyone else):  Could you possibly expand on this, maybe with some
pointers into the code.  I'm happy to try to come up with a fix myself, but I
don't like my chances with my current level of understanding.

Thanks,
NeilBrown




On Thu, 21 Oct 2010 16:18:18 -0400 Trond Myklebust
<Trond.Myklebust@xxxxxxxxxx> wrote:

> On Thu, 2010-10-21 at 16:08 -0400, Jeff Layton wrote:
> > On Thu, 21 Oct 2010 15:47:24 -0400
> > Trond Myklebust <Trond.Myklebust@xxxxxxxxxx> wrote:
> > 
> > > On Thu, 2010-10-21 at 15:40 -0400, Jeff Layton wrote:
> > > > On Thu, 21 Oct 2010 10:42:04 -0400
> > > > Trond Myklebust <Trond.Myklebust@xxxxxxxxxx> wrote:
> > > > 
> > > > > On Thu, 2010-10-21 at 10:31 -0400, Chuck Lever wrote:
> > > > > > On Oct 21, 2010, at 10:05 AM, Trond Myklebust wrote:
> > > > > > > It seems to me that the problem is basically that RPC_IS_SOFTCONN is
> > > > > > > poorly defined. IMO, the definition should really be that
> > > > > > > 'RPC_IS_SOFTCONN' tasks MUST exit with an error if they ever have to run
> > > > > > > call_bind or a call_connect more than once (i.e. if they ever have to
> > > > > > > loop back).
> > > > > > 
> > > > > > I think you are suggesting that we define the places where RPC_IS_SOFTCONN should be tested as exactly those places that might want to restart the RPC call via rpc_force_rebind, tk_action = call_refresh, and so on.  Yes?
> > > > > 
> > > > > Yes.
> > > > > 
> > > > > > > With that in mind, I really don't see why the RPC_IS_SOFTCONN case is
> > > > > > > being handled in call_transmit_status() instead of in call_status().
> > > > > > 
> > > > > > I tried it in call_status(), but it didn't work as expected.  My notes aren't specific about the problem.  One thing is for sure, there seems to be some redundant error handling logic in both of those states.  Unraveling it might be more than Neil bargained for, but would help "future generations."
> > > > > 
> > > > > call_transmit_status() is there basically in order to call
> > > > > xprt_end_transmit() in the cases where we want to free up the socket
> > > > > write lock (and then possibly sleep).
> > > > > 
> > > > > Actual state error handling is supposed occur in call_status().
> > > > > 
> > > > 
> > > > This is terribly confusing. So the connect handling is done in
> > > > call_transmit?
> > > 
> > > No. Connection-related _error conditions_ that result from trying to
> > > send stuff through a socket that is not connected, are handled in
> > > call_status(). Usually, by sending the process back to call_connect().
> > > 
> > 
> > Yep, I get that part...
> > 
> > > > It seems like it would make more sense to have the connect handling
> > > > mostly done in call_connect and call_connect_status, and only allow the
> > > > tasks to proceed to the call_transmit phase after the connect has
> > > > succeeded.
> > > 
> > > That is the case. However sockets can and do get closed by the _server_
> > > in unpredictable ways. This is what may need to be handled after the
> > > task has passed the call_connect state.
> > > 
> > 
> > Right, the socket can change state at any time, but the code doesn't
> > seem to do what you describe for initial connects.
> > 
> > In the EHOSTDOWN case, kernel_connect returns EINPROGRESS and then
> > xs_error_report is called with a socket error of EHOSTDOWN. That wakes
> > up the tasks with a status of EAGAIN and call_connect_status sends the
> > task to call_transmit even though the socket is still not connected.
> > 
> > Having xs_error_report set the status of all tasks with EAGAIN seems
> > wrong to me since we're essentially losing the error code that the
> > socket layer sent up.
> 
> No. The next task to try to access the socket should pick it up.
> 
> See my previous comment about ensuring that we don't try to handle
> state-related without adequate protection (i.e. locks) to ensure that
> the state being reported by the error hasn't changed.
> 
> Trond
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux