Re: Desired RPC client behaviour on socket errors?

Jeff Layton <jeff.layton@xxxxxxxxxxxxxxx> · Fri, 1 May 2015 09:39:49 -0400

On Fri, 1 May 2015 01:22:35 -0400 (EDT)
Jamie Bainbridge <jbainbri@xxxxxxxxxx> wrote:

> Commit 3ed5e2a introduced a change to the RPC client's handling of socket return on connect.
> 
> Prior to this commit, any error return was considered instantly fatal and rpc_exit(task,-EIO) was called.
> 
> After this commit, socket returns ECONNREFUSED ECONNRESET ECONNABORTED ENETUNREACH EHOSTUNREACH are passed back to the caller. This is a good idea and works well.
> 
> However, this commit also causes those returns to call rpc_delay(task,3*HZ) and the RPC connect to retry until the RPC times out. The timeout can be modified with soft/timeo/retrans but defaults to 3 minutes.
> 
> In practice this means if a client tries to mount and there is a permanent network error outside the client, a TCP Reset or an ICMP error might get returned, bu the mount will hang and the client will keep trying to connect many times until the RPC times out. Previously a mount would fail almost straight away.
> 
> It seems 3ed5e2a solves a problem for transient network errors but creates a problem for permanent network errors.
> 
> I agree it's probably desirable for a client application (RPC in this instance) to keep trying to connect until a timeout, and it's good the timeout is configurable, but it's bad that the timeout must be tied to all RPC operations. Someone wanting a quick mount timeout must also suffer a quick NFS operation timeout, not to mention the data corruption risk that goes along with soft.
> 
> Should the RPC client call rpc_exit() on an xprt connect which returns ECONNREFUSED ECONNRESET ECONNABORTED ENETUNREACH EHOSTUNREACH because those returns imply a "more permanent" network issue?
> 
> Disclosure: We came across this because a customer is (ab)using NFSv4 Migrations in a strange way. One server in fs_locations is firewalled behind a TCP Reset and one is not. Depending on which security zone a client is in, it can connect to one server but not the other. This enables clients in both security zones to use the same NFS mount configuration.
> 
> Cheers,
> Jamie
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

I'd say no...

One thing to consider is that it's quite common for servers (Linux or
otherwise) to return those sorts of errors as they are coming up after
a reboot. Network interfaces are often brought online before the nfs
server is ready to accept connections.

If we were to change that then you'd likely see RPCs failing in those
situations, which is almost certainly not what you want.

AFAICT, mount requests should end up trying to do a rpc_ping first,
which should have RPC_TASK_SOFTCONN set. Is that not working for some
reason?

-- 
Jeff Layton <jeff.layton@xxxxxxxxxxxxxxx>
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html